NONLINEAR  EXTENSIONS  TO  THE 
MINIMUM  AVERAGE  CORRELATION  ENERGY  FILTER 


By 

JOHN  W.  FISHER  III 


A  DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL  OF 

THE  UNIVERSITY  OF  FLORIDA  IN  PARTIAL  FULFILLMENT  OF 

THE  REQUIREMENTS  FOR  THE  DEGREE  OF  DOCTOR  OF 

PHILOSOPHY 

UNIVERSITY  OF  FLORIDA 

1997 


ACKNOWLEDGEMENTS 

There  are  many  people  I  would  like  to  acknowledge  for  their  help  in  the  genesis  of  this 
manuscript.  I  would  begin  with  my  family  for  their  constant  encouragement  and  support. 

I  am  grateful  to  the  Electronic  Communications  Laboratory  and  the  Army  Research 
Laboratory  for  their  support  of  the  reasearch  at  the  ECL.  I  was  fortunate  to  work  with  very 
talented  people,  Marion  Bartlett,  Jim  Bevington,  and  Jim  Kurtz,  in  the  areas  of  ATR  and 
coherent  radar  systems.  In  particular,  I  cannot  overstate  the  influence  that  Marion  Bartlett 
has  had  on  my  perspective  of  engineering  problems.  I  would  also  like  to  thank  Jeff  Sichina 
of  the  Army  Research  Laboratory  for  providing  many  interesting  problems,  perhaps  too 
interesting,  in  the  field  of  radar  and  ATR.  A  large  part  of  who  1  am  technically  has  been 
shaped  by  these  people. 

I  would,  of  course,  like  to  acknowledge  my  advisor,  Dr.  Jose  Principe,  for  providing  me 
with  an  invaluable  environment  for  the  study  of  nonlinear  systems  and  excellent  guidance 
throughout  the  development  of  this  thesis.  His  influence  will  leave  a  lasting  impression  on 
me.  I  would  also  like  to  thank  DARPA,  funding  by  this  institution  enabled  a  great  deal  of 
the  research  that  went  into  this  thesis.  I  would  also  like  to  thank  Drs.  David  Casasent  and 
Paul  Viola  for  taking  an  interest  in  my  work  and  offering  helpful  advice. 

I  would  also  like  to  thank  the  students,  past  and  present,  of  the  Computational  Neu- 
roEngineering  Laboratory.  The  list  includes,  but  is  not  limited  to,  Chuan  Wang  for  useful 
discussions  on  information  theory,  Neil  Euliano  for  providing  much  needed  recreational 
opportunities  and  intramural  championship  t-shirts,  Andy  Mitchell  for  being  a  good  friend 
to  go  to  lunch  with  and  who  suffered  long  inane  technical  discussions  and  who  now  is  a 
better  climber  than  me.  There  are  certainly  others  and  I  am  grateful  to  all. 

Finally  I  would  like  to  thank  my  wife,  Anita,  for  enduring  a  seemingly  endless  ordeal, 
for  allowing  me  to  use  every  ounce  of  her  patience,  and  for  sacrificing  some  of  her  best 
years  so  that  I  could  finish  this  Ph.  D.  I  hope  it  has  been  worth  it. 


TABLE  OF  CONTENTS 


ACKNOWLEDGEMENTS ii 

LIST  OF  FIGURES v 

LIST  OF  TABLES viii 

ABSTRACT ix 

CHAPTERS 

1  INTRODUCTION 1 

1 . 1     Motivation 1 

2  BACKGROUND 6 

2.1  Discussion  of  Distortion  Invariant  Filters 6 

2.1.1  Synthetic  Discriminant  Function 12 

2.1.2  Minimum  Variance  Synthetic  Discriminant  Function 15 

2.1.3  Minimum  Average  Correlation  Energy  Filter 18 

2.1.4  Optimal  Trade-off  Synthetic  Discriminant  Function 20 

2.2  Pre-processor/SDF  Decomposition 24 

3  THE  MACE  FILTER  AS  AN  ASSOCIATIVE  MEMORY 27 

3.1  Linear  Systems  as  Classifiers 27 

3.2  MSE  Criterion  as  a  Proxy  for  Classification  Performance 29 

3.2. 1  Unrestricted  Functional  Mappings 30 

3.2.2  Parameterized  Functional  Mappings 32 

3.2.3  Finite  Data  Sets 34 

3.3  Derivation  of  the  MACE  Filter 35 

3.3.1  Pre-processor/SDF  Decomposition 38 

3.4  Associative  Memory  Perspective 39 

3.5  Comments 49 

4  STOCHASTIC  APPROACH  TO  TRAINING  NONLINEAR  SYNTHETIC  DIS- 
CRIMINANT FUNCTIONS 52 

4. 1  Nonlinear  iterative  Approach 52 

4.2  A  Proposed  Nonlinear  Architecture 53 

4.2.1  Shift  Invariance  of  the  Proposed  Nonlinear  Architecture 55 

4.3  Classifier  Performance  and  Measures  of  Generalization 57 

4.4  Statistical  Characterization  of  the  Rejection  Class 67 

4.4.1  The  Linear  Solution  as  a  Special  Case 69 

4.4.2  Nonlinear  Mappings 70 


Page 

4.5  Efficient  Representation  of  the  Rejection  Class 72 

4.6  Experimental  Results 74 

4.6. 1  Experiment  I  -  noise  training 75 

4.6.2  Experiment  11  -  noise  training  with  an  orthogonalization  constraint    81 

4.6.3  Experiment  111  -  subspace  noise  training 84 

4.6.4  Experiment  IV  -  convex  hull  approach 89 

5  INFORMATION-THEORETIC  FEATURE  EXTRACTION 96 

5. 1  Introduction 96 

5.2  Motivation  for  Feature  Extraction 97 

5.3  Information  Theoretic  Background 101 

5.3.1  Mutual  Information  as  a  Self-Organizing  Principle 101 

5.3.2  Mutual  Information  as  a  Criterion  for  Feature  Extraction 104 

5.3.3  Prior  Work  in  Information  Theoretic  Neural  Processing 106 

5.3.4  Nonparametric  PDF  Estimation 108 

5.4  Derivation  Of  The  Learning  Algorithm 110 

5.5  Gaussian  Kernels 115 

5.6  Maximum  Entropy/  PCA:  An  Empirical  Comparison 118 

5.7  Maximum  Entropy:  ISAR  Experiment 124 

5.7.1  Maximum  Entropy:  Single  Vehicle  Class 125 

5.7.2  Maximum  Entropy:  Two  Vehicle  Classes 127 

5.8  Computational  Simplification  of  the  Algorithm 127 

5.9  Conversion  of  Implicit  Error  Direction  to  an  Explicit  Error 136 

5.9.1  Entropy  Minimization  as  Attraction  to  a  Point 136 

5.9.2  Entropy  Maximization  as  Diffusion 139 

5.9.3  Stopping  Criterion 141 

5.10  Observations 143 

5.11  Mutual  Information  Applied  to  the  Nonlinear  MACE  Filters 144 

6  CONCLUSIONS 151 

APPENDIX 

A       DERIVATIONS 155 

REFERENCES 168 

BIOGRAPHICAL  SKETCH 173 


LIST  OF  FIGURES 


Figure. 

1  ISAR  images  of  Iwo  vehicle  types 9 

2  MSF  peak  output  response  of  training  vehicle  1  a  over  all  aspect  angles 10 

3  MSF  peak  output  response  of  testing  vehicles  1  b  and  2a  over  all  aspect  angles.  1 1 

4  MSF  output  image  plane  response 12 

5  SDF  peak  output  response  of  training  vehicle  la  over  all  aspect  angles 15 

6  SDF  peak  output  response  of  testing  vehicles  1  b  and  2a  over  all  aspect  angles.  1 6 

7  SDF  output  image  plane  response 17 

8  MACE  filter  output  image  plane  response 20 

9  MACE  peak  output  response  of  vehicle  la,  lb  and  2a  over  all  aspect  angles..  .  21 

10  Example  of  a  typical  OTSDF  performance  plot 23 

11  OTSDF  filter  output  image  plane  response 24 

12  OTSDF  peak  output  response  of  vehicle  la  over  all  aspect  angles 25 

13  OTSDF  peak  output  response  of  vehicles  lb  and  2a  over  all  aspect  angles,  ...  26 

14  Decomposition  of  distortion  invariant  filter  in  space  domain 26 

15  Adaline  architecture 28 

16  Decomposition  of  MACE  filter  as  a  preprocessor  (i.e.  a  pre-whitening  filter  over 
the  average  power  spectrum  of  the  exemplars)  followed  by  a  synthetic  discrimi- 
nant function 39 

17  Decomposition  of  MACE  filter  as  a  preprocessor  (i.e.  a  pre-whitening  filter  over 
the  average  power  spectrum  of  the  exemplars)  followed  by  a  linear  associative 
memory 43 

18  Peak  output  response  over  all  aspects  of  vehicle  la  when  the  data  matrix  which  is 
not  full  rank 47 

19  Output  correlation  surface  for  LMS  computed  filter  from  non  full  rank  data. . .    48 

20  Learning  curve  for  LMS  approach 49 

21  NMSE  belween  closed  form  solution  and  iterative  solution 50 

22  Decomposition  of  optimized  correlator  as  a  pre-processor  followed  by  SDF/LAM 
(top).  Nonlinear  variation  shown  with  MLP  replacing  SDF  in  signal  flow  (middle), 
detail  of  the  MLP  (bottom).  The  linear  transformation  represents  the  space  domain 
equivalent  of  the  spectral  pre-processor 54 

23  ISAR  images  of  two  vehicle  types  shown  at  aspect  angles  of  5,  45,  and  85  degrees 
respectively 59 


Eagc 

24  Generalization  as  measured  by  the  minimum  peak  response 62 

25  Generalization  as  measured  by  the  peak  response  mean  square  error. 63 

26  Comparison  of  ROC  curves 64 

27  ROC  performance  measures  versus 66 

28  Peak  output  response  of  linear  and  nonlinear  filters  over  the  training  set 77 

29  Output  response  of  linear  filter  (top)  and  nonlinear  filter  (bottom) 78 

30  ROC  curves  for  linear  filter  (solid  line)  versus  nonlinear  filter  (dashed  line). .  .  79 

3 1  Experiment  I:  Resulting  feature  space  from  simple  noise  training 80 

32  Experiment  II:  Resulting  feature  space  when  orthogonality  is  imposed  on  the  input 
layer  of  the  MLP.  83 

33  Experiment  II:  Resulting  ROC  curve  with  orthogonality  constraint 84 

34  Experiment  II:  Output  response  to  an  image  from  the  recognition  class  training 
set 85 

35  Experiment  III:  Resulting  feature  space  when  the  subspace  noise  is  used  for  train- 
ing     88 

36  Experiment  HI:  Resulting  ROC  curve  for  subspace  noise  training 89 

37  Experiment  III:  Output  response  to  an  image  from  the  recognition  class  training 
set 90 

38  Learning  curves  for  three  methods 90 

39  Experiment  IV:  resulting  feature  space  from  convex  hull  training 94 

40  Experiment  IV:  Resulting  ROC  curve  with  convex  hull  approach 95 

41  Classical  pattern  classification  decomposition 100 

42  Decomposition  of  NL-MACE  as  a  cascade  of  feature  extraction  followed  by  dis- 
crimination  100 

43  Mutual  information  approach  to  feature  extraction 106 

44  Mapping  as  feature  extraction.  Information  content  is  measured  in  the  low  dimen- 
sional space  of  the  observed  output 108 

45  A  signal  flow  diagram  of  the  learning  algorithm 114 

46  Gradient  of  two-dimensional  gaussian  kernel.  The  kernels  act  as  attractors  to  low 
points  in  the  observed  PDF  on  the  data  when  entropy  maximization  is  desired.  117 

47  Mixture  of  gaussians  example : 118 

48  Mixture  of  gaussians  example,  entropy  minimization  and  maximization 119 

49  PCA  vs.  Entropy  -  gaussian  case 120 

50  PCA  vs.  Entropy  -  non-gaussian  case 122 

51  PCA  vs.  Entropy  -  non-gaussian  case 123 


52  Example  ISAR  images  from  two  vehicles  used  for  experiments 124 

53  Single  vehicle  experiment,  100  iterations 125 

54  Single  vehicle  experiment,  200  iterations 126 

55  Single  vehicle  experiment,  300  iterations 126 

56  Two  vehicle  experiment 128 

57  Two  dimensional  attractor  functions 133 

58  Two  dimensional  regulating  function 134 

59  Magnitude  of  the  regulating  function 134 

60  Approximation  of  the  regulating  function 135 

61  Feedback  functions  for  implicit  error  term 138 

62  Entropy  minimization  as  local  attraction 140 

63  Entropy  maximization  as  diffusion 142 

64  Stopping  criterion 143 

65  Mutual  information  feature  space 146 

66  ROC  curves  for  mutual  information  feature  extraction  (dotted  line)  versus  linear 
MACE  filter  (solid  line) 148 

67  Mutual  information  feature  space  resulting  from  convex  hull  exemplars 149 

68  ROC  curves  for  mutual  information  feature  extraction  (dotted  line)  versus  linear 
MACE  filter  (solid  line) 150 


LIST  OF  TABLES 


Table 


Classifier  performance  measures  when  the  filter  is  determined  by  either  of  the 
common  measures  of  generalization  as  compared  to  best  classifier  performance  for 
two  values  of 61 

Correlation  of  generalization  measures  to  classifier  performance.  In  both  cases  ( 
equal  to  0.5  or  0.95)  the  classifier  performance  as  measured  by  the  area  of  the  ROC 
curve  or  Pfa  at  Pd  equal  0.8,  has  an  opposite  correlation  as  to  what  would  be 
expected  of  a  useful  measure  for  predicting  performance 64 

Comparison  of  ROC  classifier  performance  for  to  values  of  Pd.  Results  are  shown 
for  the  linear  filter  versus  four  different  types  of  nonlinear  training.  N:  white  noise 
training,  G-S:  Gram-Schmidt  orthogonalization,  subN:  PCA  subspace  noise,  C-H: 
convex  hull  rejection  class 81 

Comparison  of  ROC  classifier  performance  for  to  values  of  Pd.  Results  are  shown 
for  the  linear  filter  versus  experiments  III  and  IV  from  section  4.6  and  mutual 
information  feature  extraction.The  symbols  indicate  the  type  of  rejection  class 
exemplars  used.  N:  white  noise  training,  G-S:  Gram-Schmidt  orthogonalization, 
subN:  PCA  subspace  noise,  C-H:  convex  hull  rejection  class 145 


Abstract  of  Dissertation  Presented  to  the  Graduate  School 
of  the  University  of  Florida  in  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of  Doctor  of  Philosophy 


NONLINEAR  EXTENSIONS  TO  THE  MINIMUM  AVERAGE 
CORRELATION  ENERGY  FILTER 


By 

John  W,  Fisher  III 
May  1997 


Chairman:  Dr.  Jose  C.  Principe 

Major  Department:  Electrical  and  Computer  Engineering 


The  major  goal  of  this  research  is  to  develop  efficient  methods  by  which  the  family  of 
distortion  invariant  filters,  specifically  the  minimum  average  correlation  energy  (MACE) 
filter,  can  be  extended  to  a  general  nonlinear  signal  processing  framework.  The  primary 
application  of  MACE  filters  has  been  to  pattern  classification  of  images.  Two  desirable 
qualities  of  MACE-type  correlators  are  ease  of  implementation  via  correlation  and  ana- 
lytic computation  of  the  filter  coefficients. 

Our  motivation  for  exploring  nonlinear  extensions  to  these  filters  is  due  to  the  well- 
known  limitations  of  the  linear  systems  approach  to  classification.  Among  these  limita- 
tions the  attempt  to  solve  the  classification  problem  in  a  signal  representation  space, 
whereas  the  classification  problem  is  more  properly  solved  in  a  decision  or  probability 
space.  An  additional  limitation  of  the  MACE  filter  is  that  it  can  only  be  used  to  realize  a 
linear  decision  surface  regardless  of  the  means  by  which  it  is  computed.  These  limitations 
lead  to  suboptimal  classification  and  discrimination  performance. 


Extension  to  nonlinear  signal  processing  is  not  without  cost.  Solutions  must  in  general 
be  computed  iteratively.  Our  approach  was  motivated  by  the  early  proof  that  the  MACE 
filter  is  equivalent  to  the  linear  associative  memory  (LAM).  The  associative  memory  per- 
spective is  more  properly  associated  with  the  classification  problem  and  has  been  devel- 
oped extensively  in  an  iterative  framework. 

In  this  thesis  we  demonstrate  a  method  emphasizing  a  statistical  perspective  of  the 
MACE  filter  optimization  criterion.  Through  the  statistical  perspective  efficient  methods 
of  representing  the  rejection  and  recognition  classes  are  derived.  This,  in  turn,  enables  a 
machine  learning  approach  and  the  synthesis  of  more  powerful  nonlinear  discriminant 
functions  which  maintain  the  desirable  properties  of  the  linear  MACE  filter,  namely,  local- 
ized detection  and  shift  invariance. 

We  also  present  a  new  information  theoretic  approach  to  training  in  a  self-organized  or 
supervised  manner.  Information  theoretic  signal  processing  looks  beyond  the  second  order 
statistical  characterization  inherent  in  the  linear  systems  approach.  The  information  theo- 
retic fi-amework  probes  the  probability  space  of  the  signal  under  analysis.  This  technique 
has  wide  application  beyond  nonlinear  MACE  filter  techniques  and  represents  a  powerful 
new  advance  to  the  area  of  information  theoretic  signal  processing. 

Empirical  results,  comparing  the  classical  linear  methodology  to  the  nonlinear  exten- 
sions, are  presented  using  inverse  synthetic  aperture  radar  (ISAR)  imagery.  The  results 
demonstrate  the  superior  classification  performance  of  the  nonlinear  MACE  filter. 


CHAPTER  1 
INTRODUCTION 

1 . 1  Motivation 
Automatic  target  detection  and  recognition  (ATD/R)  is  a  field  of  pattern  recognition. 
The  goal  of  an  ATD/R  system  is  to  quickly  and  automatically  detect  and  classify  objects 
which  may  be  present  within  large  amounts  of  data  (typically  imagery)  with  a  minimum  of 
human  intervention.  In  an  ATD/R  system,  it  is  not  only  desirable  to  recognize  various  tar- 
gets, but  to  locate  them  with  some  degree  of  accuracy.  The  minimum  average  correlation 
energy  (MACE)  filter  [Mahalanobis  et  al.,  1987]  is  of  interest  to  the  ATD/R  problem  due 
to  its  localization  and  discrimination  properties.  The  MACE  filter  is  a  member  of  a  family 
of  correlation  filters  derived  from  the  synthetic  discriminant  function  (SDF)  [Hester  and 
Casasent,  1980].  The  SDF  and  its  variants  have  been  widely  applied  to  the  ATD/R  prob- 
lem. We  will  describe  synthetic  discriminant  functions  in  more  detail  in  chapter  2.  Other 
generalizations  of  the  SDF  include  the  minimum  variance  synthetic  discriminant  fijnction 
(MVSDF)  [Kumar,  1986],  the  MACE  filter,  and  more  recently  the  gaussian  minimum 
average  correlation  energy  (G-MACE)  [Casasent  et  al.,  1991]  and  the  minimum  noise  and 
correlation  energy  (MfNACE)  [Ravichandran  and  Casasent,  1992]  filters. 

This  area  of  filter  design  is  commonly  referred  to  as  distortion-invariant  filtering.  It  is  a 
generalization  of  matched  spatial  filtering  for  the  detection  of  a  single  object  to  the  detec- 
tion of  a  class  of  objects,  usually  in  the  image  domain.  Typically  the  object  class  is  repre- 
sented by  a  set  of  exemplars.  The  exemplar  images  represent  the  image  class  through  a 

1 


2 
range  of  "distortions"  such  as  a  variation  in  viewing  aspect  of  a  single  object.  Tlie  goal  is 
to  design  a  single  filter  which  will  recognize  an  object  class  through  the  entire  range  of 
distortion.  Under  the  design  criterion  the  tilter  is  equally  matched  to  the  entire  range  of 
distortion  as  opposed  to  a  single  viewpoint  as  in  a  matched  filter.  Hence  the  nomenclature 
distortion-invariant  filtering  [Kumar,  1992]. 

The  bulk  of  the  research  using  these  types  of  iillers  has  focused  on  optical  and  infra- 
red (IR)  imagery  and  overcoming  recognition  problems  in  the  presence  of  distortions  asso- 
ciated with  3-D  to  2-D  mappings,  e.g.  scale  and  rotation  (in-plane  and  out-of-plane). 
Recently,  however,  this  technique  has  been  applied  to  radar  imagery  [Novak  et  al.,  1994; 
Fisher  and  Principe,  1995a;  Chiang  et  al.,  1995].  In  contrast  to  optical  or  infra-red  imag- 
ery, the  scale  of  each  pixel  within  a  radar  image  is  usually  constant  and  known.  Conse- 
quently, radar  imagery  does  not  suffer  from  scale  distortions  of  objects. 

In  the  family  of  distortion  invariant  filters,  the  MACE  filter  has  been  shown  to  posses 
superior  discrimination  properties  [Mahalanobis  et  al.,  1987,  Casasent  and  Ravichandran, 
1992].  It  is  for  this  reason  that  this  work  emphasizes  nonlinear  extensions  to  the  MACE 
filter.  The  MACE  filter  and  its  variants  are  designed  to  produce  a  narrow,  constrained- 
amplitude  peak  response  when  the  filler  mask  is  centered  on  a  target  in  the  recognition 
class  while  minimizing  the  energy  in  the  rest  of  the  output  plane.  This  property  provides 
desirable  localization  for  detection.  Another  property  of  the  MACE  filter  is  that  it  is  less 
susceptible  to  out-of-class  false  alarms  [Mahalanobis  et  al.,  1987].  While  the  focus  of  this 
work  will  be  on  the  MACE  filter  criterion,  it  should  be  stated  that  all  of  the  results  pre- 
sented here  are  equally  applicable  to  any  of  the  distortion  invariant  filters  mentioned  above 
with  appropriate  changes  to  the  respective  optimization  criteria. 


3 

Although  the  MACE  fiher  does  have  superior  false  alarm  properties,  it  also  has  some 
fundamental  limitations.  Since  it  is  a  linear  filter,  it  can  only  be  used  to  realize  linear  deci- 
sion surfaces.  It  has  also  been  shown  to  be  limited  in  its  ability  to  generalize  to  exemplars 
that  are  in  the  recognition  class  (but  not  in  the  training  set),  while  simultaneously  rejecting 
out-of-class  inputs  [Casasent  and  Ravichandran,  1 992;  Casasent  et  al.,  1 99 1 ].  The  number 
of  design  exemplars  can  be  increased  in  order  to  overcome  generalization  problems;  how- 
ever, the  calculation  of  the  filter  coefficients  becomes  computationally  prohibitive  and 
numerically  unstable  as  the  number  of  design  exemplars  is  increased  [Kumar,  1992].  The 
MINACE  and  G-MACE  variations  have  improved  generalization  properties  with  a  slight 
degradation  in  the  average  output  plane  variance  [Ravichandran  and  Casasent,  1992]  and 
sharpness  of  the  central  peak  [Casasent  et  al.,  1991],  respectively. 

This  research  presents  a  basis  by  which  the  MACE  filter,  and  by  extension  all  linear 
distortion  invariant  filters,  can  be  extended  to  a  more  general  nonlinear  signal  processing 
framework.  In  the  development  it  is  shown  that  the  performance  of  the  linear  MACE  filter 
can  be  improved  upon  in  terms  of  generalization  while  maintaining  its  desirable  proper- 
ties, i.e.  sharp,  constrained  peak  at  the  center  of  the  output  plane. 

A  more  detailed  description  of  the  developmental  progression  of  distortion  invariant 
filtering  is  given  in  chapter  2.  In  this  chapter  a  qualitative  comparison  of  the  various  distor- 
tion invariant  filters  is  presented  using  inverse  synthetic  aperture  radar  (ISAR)  imagery. 
The  application  of  pattern  recognition  techniques  to  high-resolution  radar  imagery  has 
become  a  topic  of  great  interest  recently  with  the  advent  of  widely  available  instrumenta- 
tion grade  imaging  radars.  High  resolution  radar  imagery  poses  a  special  challenge  to  dis- 
tortion invariant  filtering  in  that  the  source  of  distortions  such  as  rotation  in  aspect  of  an 


4 

object  do  not  manifest  themselves  as  rotations  within  the  radar  image  (as  opposed  to  opti- 
cal imagery).  In  this  case  the  distortion  is  not  purely  geometric,  but  more  abstract. 

Chapter  3  presents  a  derivation  of  the  MACE  filter  as  a  special  case  of  Kohonen's  lin- 
ear associative  memory  [1988].  This  relationship  is  important  in  that  the  associative  mem- 
ory perspective  is  the  starting  point  for  developing  nonlinear  extensions  to  the  MACE 
filter. 

In  chapter  4  the  basis  upon  which  the  MACE  filter  can  extended  to  nonlinear  adaptive 
systems  is  developed.  In  this  chapter  a  nonlinear  architecture  is  proposed  for  the  extension 
of  the  MACE  filter.  A  statistical  perspective  of  the  MACE  filter  is  discussed  which  leads 
naturally  into  a  class  representational  viewpoint  of  the  optimization  criterion  of  distortion 
invariant  filters.  Commonly  used  measures  of  generalization  for  distortion  invariant  filter- 
ing are  also  discussed.  The  results  of  the  experiments  presented  show  that  the  measures 
are  not  appropriate  for  the  task  of  classification.  It  is  interesting  to  note  that  the  analysis 
indicates  the  appropriateness  of  the  measures  is  independent  of  whether  the  mapping  is 
linear  or  nonlinear.  The  analysis  also  discusses  the  merit  of  the  MACE  filter  optimization 
criterion  in  the  context  of  classification  and  with  regards  to  measures  of  generalization. 
The  chapter  concludes  with  a  series  of  experiments  further  refining  the  techniques  by 
which  nonlinear  MACE  filters  are  computed. 

Chapter  5  presents  a  new  information  theoretic  method  for  feature  extraction.  An 
information  theoretic  approach  is  motivated  by  the  observation  that  the  optimization  crite- 
rion of  the  MACE  filter  only  considers  the  second-order  statistics  of  the  rejection  class. 
The  information  theoretic  approach,  however,  operates  in  probability  space,  exploiting 
properties  of  the  underlying  probability  density  ftmction.  The  method  enables  the  extrac- 


5 

tion  of  statistically  independent  features.  The  method  has  wide  application  beyond  nonlin- 
ear extensions  to  MACE  filters  and  as  such  represents  a  powerful  new  technique  for 
information  theoretic  signal  processing.  A  review  of  information  theoretic  approaches  to 
signal  processing  are  presented  in  this  chapter.  This  is  followed  by  the  derivation  of  the 
new  technique  as  well  as  some  general  experimental  results  which  are  not  specifically 
related  to  nonlinear  MACE  filters,  but  which  serve  to  illustrate  the  potential  of  this 
method.  Finally  the  logical  placement  of  this  method  within  nonlinear  MACE  filters  is 
presented  along  with  experimental  results. 

In  chapter  6  we  review  the  significant  results  and  contributions  of  this  dissertation.  We 
also  discuss  possible  lines  of  research  resulting  from  the  base  established  here. 


CHAPTER  2 
BACKGROUND 

2. 1   Discussion  of  Distortion  Invariant  Filters 
As  stated,  distortion  invariant  filtering  is  a  generalization  of  matched  spatial  filtering. 
It  is  well  known  that  the  matched  filter  maximizes  the  peak-signal-to-average-noise  power 
ratio  as  measured  at  the  filter  output  at  a  specific  sample  location  when  the  input  signal  is 
corrupted  by  additive  white  noise. 

In  the  discrete  signal  case  the  design  of  a  matched  filter  is  equivalent  to  the  following 
vector  optimization  problem.[Kumar,  1986] 

min  fttft  St.  x^h  =  d  {h,x}eC'^''\ 

where  the  column  vector  x  contams  the  N  coefficients  of  the  signal  we  wish  to  detect,  h 
contains  the  coefficients  of  the  filter  (  +  indicates  the  hermitian  transpose  operator),  and  d 
is  a  positive  scaler.  This  notation  is  also  suitable  for  N-dimensional  signal  processing  as 
long  as  the  signal  and  filter  have  finite  support  and  are  re-ordered  in  the  same  lexico- 
graphic manner  (e.g.  by  row  or  column  in  the  two-dimensional  case)  into  column  vectors. 
The  optimal  solution  to  this  problem  is 


h  =  x{x^x)  'd. 


7 

Given  this  solution  we  can  calculate  the  peak  output  signal  power  as 

=  d'' 

and  the  average  output  noise  pov/er  due  to  an  additive  white  noise  input 

where  is  an  0^  is  the  input  noise  variance.  Resulting  in  a  peak-signal-to-average-noise 
output  power  ratio  of 


(1-]  = ^ 


As  we  can  see,  the  result  is  independent  of  the  choice  of  scalar,  d.U  d  is  set  to  unity, 
the  result  is  a  normalized  matched  spatial  filter.[Vander  Lugt,  1964] 

In  order  to  further  motivate  the  concept  of  distortion  invariant  filtering,  a  typical  ATR 
example  problem  will  be  used  for  illustration.  This  experiment  will  also  help  to  illustrate 
the  genesis  of  the  various  types  of  distortion  invariant  filtering  approaches  beginning  with 
the  matched  spatial  filter  (MSF). 

Inverse  synthetic  aperture  radar  (ISAR)  imagery  will  be  used  for  all  of  the  experiments 
presented  herein.  The  distortion  invariant  filtering;  however,  is  not  limited  to  ISAR  imag- 
ery and  in  fact  can  be  extended  to  much  more  abstract  data  types.  ISAR  images  are  shown 


8 

in  figure  1.  In  the  figure,  three  vehicles  are  displayed,  each  at  three  different  radar  viewing 

'  '*  '  aspect  angles  (5, 45,  and  85  degrees),  where  the  aspect  angle  is  the  direction  of  the  front  of 

••  the  vehicle  relative  to  the  radar  antenna.  The  image  dimensions  are  64  x  64  pixels.  Radar 

^•;    -  systems  measure  a  quantity  called  radar  cross  section  (RCS).  When  a  radar  transmits  an 

■'■*  electromagnetic  pulse,  some  of  the  incident  energy  on  an  object  is  reflected  back  to  the 

'■   ■  radar  RCS  is  a  measure  of  the  reflected  energy  detected  by  the  radar's  receiving  antenna. 

'    '     '-        ISAR  imagery  is  the  result  of  a  radar  signal  processing  technique  which  uses  multiple 

detected  radar  returns  measured  over  a  range  of  relative  object  aspect  angles.  Each  pixel  in 

>•  an  ISAR  image  is  a  measure  of  the  aggregate  radar  cross  section  at  regularly  sampled 

points  in  space. 

Two  types  of  vehicles  are  shown.  Vehicle  type  1  will  represent  a  recognition  class, 

^-■^  while  vehicle  type  2  will  represent  a  confusion  class.  The  goal  is  to  compute  a  filter  which 

■  '  will  recognize  vehicle  type  1  without  being  confused  by  vehicle  2.  Images  of  vehicle  la 

•a-  will  be  used  to  compute  the  filter  coefficients.  Vehicles  lb  and  2a  represent  an  independent 

testing  class. 

ISAR  images  of  all  three  vehicles  were  formed  in  the  aspect  range  of  5  to  85  degrees  at 
1  degree  increments.  As  the  MSP  is  derived  from  a  single  vehicle  image,  an  image  of  vehi- 
cle la  at  45  degrees  (the  midpoint  of  the  aspect  range)  is  used. 

The  peak  output  response  to  an  image  represents  maximum  of  the  cross  correlation 
function  of  the  image  with  the  MSF  template.  The  peak  output  response  over  the  entire 
aspect  range  of  vehicle  la  is  shown  in  figure  2.  As  can  be  seen  in  the  figure,  the  filter 
matches  at  45  degrees  very  well;  however,  as  the  aspect  moves  away  from  45  degrees,  the 


•■"^i: 


SS- 


.iis- 


vcliicle  lb  (tesliiii^J 


Figure  1.    ISAR  images  of  two  vehicle  types.  Vehicles  are  shown  at  aspect  angles 

<  ■  of  5,  45,  and  85  degrees  respectively.  Two  different  vehicles  of  type  I  (a 

and  b)  are  shown,  while  one  vehicle  of  type  2  (a)  is  shown.  Vehicle  la 

is  used  as  a  training  vehicle,  while  vehicle  lb  is  used  as  the  testing 

vehicle  for  the  recognition  class.  Vehicle  2a  represents  a  confusion 

"  vehicle. 

peak  output  response  begins  to  degrade.  Depending  on  the  type  of  imagery  as  well  as  the 

vehicle,  this  degradation  can  become  very  severe. 


1.2 


1.0 


0.8 


f    0.6 


0.4 


0.2 


0.0 


•nil 
10 

matched  spatial  filter 


20 


40  60 

aspect  angle 


80 


100 


Figure  2.    MSF  peak  output  response  of  training  vehicle  1  a  over  all  aspect  angles. 
Peak  response  degrades  as  aspect  difference  increases. 

The  peak  output  responses  of  both  vehicles  in  the  testing  set  are  shown  in  figure  3 
overlain  on  the  training  image  respon.se.  In  one  sense  the  filter  exhibits  good  generaliza- 
tion, that  is,  the  peak  response  to  vehicle  1  b  is  much  the  same  as  a  function  of  aspect  as  the 
peak  response  to  vehicle  la.  However,  the  filter  also  "generalizes"  equally  as  well  to  vehi- 
cle 2b,  which  is  undesirable.  As  a  vehicle  discrimination  test  (vehicle  1  from  vehicle  2)  the 
MSF  fails. 


matched  spatial  filter 

1.2 

'     '     1     '     '     '     1     '     '     '     1     '     '     '     1     ' 

1.0 

- 

I 

response 

o              o  - 

OJ                    CO 

^'■^-^ 

- 

^    0.4 

- 

0.2 

- 

- 

0.0 

1     .     .     .     1     .     ,     ,     1     ,     .     ,     [     , 

c 

20                   40                   60                   80 

100 

aspect  angle 

Figure  3.  MSF  peak  output  response  of  testing  vehicles  lb  and  2a  over  ail  aspect 
angles.  Responses  are  overlaid  on  training  vehicle  response.  Filter 
responses  to  vehicles  lb  (dashed  line)  and  2a  (dashed-dot)  do  not  differ 
significantly. 


12 

The  output  image  plane  response  to  a  single  image  of  vehicle  la  is  shown  in  figure  4. 
Refinements  to  the  distortion  invariant  filter  approach,  namely  the  MACE  filter,  will  show 
that  the  localization  of  this  output  response,  as  measured  by  the  sharpness  of  the  peak,  can 
be  improved  significantly. 


Figure  4.    MSF  output  image  plane  response. 


2.1.1   Svnthetic  Discriminant  Function 

The  degradation  evidenced  in  figures  2  and  3  were  the  primary  motivation  for  the  syn- 
thetic discriminant  function  (SDF)[Hester  and  Casasent,  1980].  A  shortcoming  of  the 
MSF,  from  the  standpoint  of  distortion  invariant  filtering,  is  that  it  is  only  optimum  for  a 
single  image.  One  approach  would  be  to  design  a  bank  of  MSFs  operating  in  parallel 
which  were  matched  to  the  distortion  range.  The  typical  ATR  system;  however,  must  rec- 
ognize/discriminate multiple  vehicle  types  and  so  from  an  implementation  standpoint 
alone  parallel  MSFs  is  an  impractical  choice.  Hester  and  Casasent  set  out  to  design  a  sin- 


13 

gle  filter  which  could  be  matched  to  multiple  images  using  the  idea  of  superposition.  This 
approach  was  possible  due  to  the  large  number  of  coefficients  (degrees  of  freedom)  that 
typically  constitute  2-D  image  templates.  For  historical  reasons,  specifically  that  the  filters 
in  question  were  synthesized  optically  using  holographic  techniques  [Vander  Lugt,  1964], 
it  was  hypothesized  that  such  a  fdter  could  be  synthesized  from  linear  combinations  of  a 
set  of  exemplar  images. 

The  filter  synthesis  procedure  consists  of  projecting  the  exemplar  im.iges  onto  an 
ortho-normal  basis  (originally  Gram-Schmidt  orthogonalization  was  used  to  generate  the 
basis).  The  next  step  is  to  determine  the  coefficients  with  which  to  linearly  combine  the 
basis  vectors  such  that  a  desired  response  for  each  original  image  exemplar  was  obtained. 
[Hester  and  Casasent,  1 980] 

The  proposed  synthesis  procedure  is  a  bit  convoluted.  It  turns  out  that  the  choice  of 
ortho-normal  basis  is  irrelevant.  As  long  as  the  basis  spans  the  space  of  the  original  exem- 
plar images  the  result  is  always  the  same.  The  development  of  Kumar  [1986]  is  more  use- 
ful for  depicting  the  SDF  as  a  generalization  of  the  matched  filter  (for  the  white  noise 
case)  to  multiple  signals.  The  SDF  can  be  cast  as  the  solution  to  the  following  optimiza- 
tion problem 

min  Atft  s.t.  Xih=d  {/.GC^^Uec'^^^rfec''''"} 

where  X  is  now  a  matrix  whose  N,  columns  comprise  a  set  of  training  images'  we  wish 
to  detect,  rf  is  a  column  vector  of  desired  outputs  (one  for  each  of  the  training  exemplars) 


1.  Smce  these  filters  have  been  applied  primarily  to  2D  images,  signals  will  be  referred  to 
as  images  or  exemplars  from  this  point  on.  In  the  vector  notation,  all  N.  x  N-,  images  are 
re-ordered  (by  row  or  column)  into  A^  x  1  column  vectors,  where  N  =  N.n' 


14 

and  is  typically  set  to  all  unity  values  for  the  recognition  class.  The  images  of  the  data 
matrix  X  comprise  the  range  of  distortion  that  the  implemented  filter  is  expected  to 
encounter  It  is  assumed  that  N,  <  N  and  so  the  problem  formulation  is  a  quadratic  optimi- 
zation subject  to  an  under-determined  system  of  linear  constraints.  The  optimal  solution  is 

/i  =  xix-fxy^d. 

When  there  is  only  one  training  exemplar  ( A?,  =  1 )  and  rf  is  unity  the  SDF  defaults  to 
the  normalized  matched  filter.  Similar  to  the  matched  filter  (white  noise  case),  the  SDF  is 
the  linear  filter  which  minimizes  the  white  noise  response  while  satisfying  the  .set  of  linear 
constraints  over  the  training  exemplars. 

By  way  of  example,  the  SDF  technique  is  tested  against  the  ISAR  data  as  in  the  MSF 
case.  Exemplar  images  from  vehicle  la  were  selected  at  every  4  degrees  aspect  from  5  to 
85  degrees  for  a  total  of  2!  exemplar  images  (i.e.  A',  =  21 ).  Figure  5  shows  the  peak  out- 
put response  over  all  aspects  of  the  training  vehicle  (la).  As  seen  in  the  figure,  the  degra- 
dation as  the  aspect  changes  is  removed.  The  MSF  response  has  been  overlaid  to  highlight 
the  differences.  [ 

The  peak  output  response  over  all  exemplars  in  the  testing  set  is  shown  in  figure  6. 
From  the  perspective  of  peak  response,  the  filter  generalizes  fairly  well.  However,  as  in  the 
MSF,  the  usefulness  of  the  filter  as  a  discriminant  between  vehicles  I  and  2  is  clearly  lim- 
ited. 

Figure  7  shows  the  resulting  output  plane  response  when  the  SDF  filter  is  correlated 
with  a  single  image  of  vehicle  1  a.  The  localization  of  the  peak  is  similar  to  the  MSF  case. 


15 


synthetic  discriminant  function 

1.2 

1.0 

peak  response 

poo 

*■                      05                      03 

r  ■' 
-    \  1 

0.3 

- 

0.0 

I.I 1       .       .       .       1       .       . 

- 

C 

20                   40                   60                   80 

100 

aspect  angle 

Figure  5.  SDF  peak  output  response  of  training  vehicle  la  over  ail  aspect  angles. 
The  MSF  response  is  also  shown  (dashed  line).  The  degradation  in  the 
peak  response  has  been  corrected. 

2.1.2  Minimum  Variance  Synthetic  Discriminant  Function 

The  SDF  approach  seemingly  .solved  the  problem  of  generahzing  a  matched  filter  to 

multiple  images.  However,  the  SDF  has  no  built-in  noise  tolerance  by  design  (except  for 

the  white  noise  case).  Furthermore,  in  practice,  it  would  turn  out  that  occasionally  the 

noise  response  would  be  higher  than  the  peak  object  response  depending  on  the  type  of 

imagery.  As  a  result,  detection  by  means  of  searching  for  correlation  peaks  was  shown  to 

be  unreliable  for  some  types  of  iinagery,  specifically  imagery  which  contains  recognition 

class  images  embedded  in  non-white  noise[Kumar,  1992].  Kumar  [1986]  proposed  a 

method  by  which  noise  tolerance  could  be  built  in  to  the  filter  design.  This  technique  was 

termed  the  minimum  variance  synthetic  discriminant  function  (MVSDF).  The  MVSDF  is 


16 


1.2 


1.0 


0.8 


<u    0.6 


0.4 


0.2 


..oL 


synthetic  discriminant  function 


\yVt^ 


,',^-i'V^.  .    v. 


A''\'',/-r-' 


20 


iO  60 

aspect  angle 


80 


100 


Figure  6.  SDF  peak  output  response  of  testing  vehicles  lb  and  2a  over  all  aspect 
angles.  The  dashed  line  is  vehicle  lb  while  the  dashed-dot  line  is 
vehicle  2a. 

the  correlation  filter  which  minimizes  the  output  variance  due  to  zero-mean  input  noise 

while  satisfying  the  same  linear  constraints  as  the  SDF.  The  output  noise  variance  can  be 

shown  to  be  h^X„h  ,  where  h  is  the  vector  of  filter  coefficients  and  S,,  is  the  covariance 

matrix  of  the  noise.  [Kumar,  1986] 

Mathematically  the  problem  formulation  is 


min  h^Zh  s.t.  X^^h  =  d 


Aec'"<Uec'"^^j:„EC'^'^^rfEc"'^'.' 


17 


Figure  7.    SDF  output  image  plane  response, 
with  the  optimal  solution 


h  =  i.;,^xixtirjxf  d. 

In  the  case  of  white  noise,  the  MVSDF  is  equivalent  to  the  SDF  This  technique  has  a 
significant  numerical  complexity  issue  which  is  that  the  solution  requires  the  inversion  of 
an  A^ X  A'  matrix  (S„ )  which  for  moderate  image  sizes  {N  =  N^N^)  can  be  quite  large 
and  computationally  prohibitive,  unless  simplifying  assumptions  can  be  made  about  its 
form  (e.g.  a  diagonal  matrix,  toeplitz,  etc.). 

The  MVSDF  can  be  seen  as  a  more  general  extension  of  the  matched  filter  to  multiple 
vector  detection  as  most  signal  processing  definitions  of  the  matched  filter  incorporate  a 
noise  power  spectrum  and  do  not  assume  the  white  noise  case  only.  It  is  mentioned  here 
because  it  is  the  first  distortion  invariant  filtering  technique  to  recognize  the  need  to  char- 
acterize a  rejection  class. 


18 

2.1.3  Minimum  Average  Correlation  Energy  Filter 

The  MVSDF  (and  the  SDF)  control  the  output  of  the  filter  at  a  single  point  in  the  out- 
put plane  of  the  filter.  In  practice  large  .sidelobes  may  be  exhibited  in  the  output  plane 
making  detection  difficult.  These  difficulties  led  Mahalanobis  et  al  [1987]  to  propose  the 
minimum  average  correlation  energy  (MACE)  filter.  This  development  in  distortion  invari- 
ant filtering  attempts  as  its  design  goal  to  control  not  only  the  output  point  when  the  image 
is  centered  on  the  filter,  but  the  response  of  the  entire  output  plane  as  well.  Specifically  it 

minimizes  the  average  correlation  energy  of  the  output  over  the  training  exemplars  subject 

I 
to  the  same  linear  constraints  as  the  MVSDF  and  SDF  filters. 

The  problem  is  formulated  in  the  frequency  domain  using  Parseval  relationships.  In 
the  frequency  domain,  the  formulation  is 

min  WDH  s.t.  X^H  =  d 

where  D  is  a  diagonal  matrix  whose  diagonal  elements  are  the  coefficients  of  the  average 
2-D  power  spectrum  of  the  training  exemplars.  The  form  of  the  quadratic  criterion  is 
derived  using  Parseval's  relationship.  .A  derivation  is  given  in  section  A.  1  of  the  appendix. 
The  other  terms,  H  and  X ,  contain  the  2-D  DFT  coefficients  of  the  filter  and  training 
exemplars,  respectively.  The  vector  d  is  the  same  as  in  the  MVSDF  and  SDF  cases.  The 
optimal  solution,  in  the  frequency  domain,  is 

H  =  fl  '^(Xt£»  'A')"'rf.  (1) 

As  in  the  MVSDF,  the  solution  requires  the  inversion  of  an  N  y.  N  matrix,  but  in  this 

case  the  matrix  D  is  diagonal  and  so  its  inversion  is  trivial.  When  the  noise  covariance 


19 

matrix  is  estimated  from  observations  of  noise  sequences  (assuming  wide-sense  stationar- 
ity  and  ergodicity)  the  MVSDF  can  also  be  formulated  in  the  frequency  domain,  as  well, 
and  the  complex  matrix  inversion  is  avoided.  A  derivation  of  this  is  given  in  the  appendix 
A,  examination  of  equations  (95),  (96),  (97)  shows  that  under  the  assumption  that  the 
noise  class  can  be  modeled  as  a  stationary,  ergodic  random  noise  process  the  solution  of 
the  MVSDF  can  be  found  in  the  spectral  domain  using  the  estimated  power  spectrum  of 
the  noise  process  and  equation  (1). 

In  practice,  the  MACE  filter  performs  better  than  the  MVSDF  with  respect  to  rejecting 
out-of-class  input  images.  The  MACE  filter;  however,  has  been  shown  to  have  poor  gener- 
alization properties,  that  is,  images  in  the  recognition  class  but  not  in  the  training  exemplar 
set  are  not  recognized. 

A  MACE  filter  was  computed  using  the  same  exemplar  images  as  in  the  SDF  example. 
Figure  8  shows  the  resulting  output  image  place  response  for  one  image.  As  can  be  seen  in 
the  figure,  the  peak  in  the  center  is  now  highly  localized.  In  fact  it  can  be  shown  [Mahal- 
anobis  et  al.,  1987]  that  over  the  training  exemplars  (those  u.sed  to  compute  the  filter)  the 
output  peak  will  always  be  at  the  constraint  location. 

Generalization  to  between  aspect  images,  as  mentioned,  is  a  problem  for  the  MACE 
filter.  Figure  9  shows  the  peak  output  response  over  all  aspect  angles.  As  can  be  seen  in  the 
figure,  the  peak  response  degrades  severely  for  aspects  between  the  exemplars  used  to 
compute  the  filter.  Furthermore,  from  a  peak  output  response  viewpoint,  generalization  to 
vehicle  lb  is  also  worse.  However,  unlike  the  previous  techniques,  we  now  begin  to  see 
some  separation  between  the  two  vehicle  types  as  represented  by  their  peak  response. 


20 


Figure  8.    MACE  filter  output  image  plane  response. 
2.1.4  Optimal  Trade-off  Synthetic  Discriminant  Function 

The  final  distortion  invariant  filtering  technique  which  will  be  discussed  here  is  the 
method  proposed  by  Refregrier  and  Pique  [1991],  known  as  the  optimal  trade-off  syn- 
thetic discriminant  function  (OTSDF).  Suppose  that  the  designer  wishes  to  optimize  over 
multiple  quadratic  optimization  criteria  (e.g.  average  correlation  energy  and  output  noise 
variance)  subject  to  the  same  set  of  equality  constraints  as  in  the  previous  distortion  invari- 
ant filters.  We  can  represent  the  individual  optimization  criterion  by 


where  Q-  is  an  A'  x  A'  symmetric,  positive-definite  matrix  (e.g.  g,.  =  Z^  for  MVSDF 
optimization  criterion). 

The  OTSDF  is  a  method  by  which  a  .set  of  quadratic  optimization  criterion  may  be 
optimally  traded  off  against  each  other;  that  is,  one  criterion  can  be  minimized  with  mini- 


21 


MACE  filter 


1.2 


1.0 


0.8 


<u    0.6 


0.4 


0.2 


0.0 


<f>c><f,c,<f,c,<f,<f,<;>ac,c,a<f><f<f.aa 


0 


20 


40  60 

aspect  angle 


80 


100 


Figure  9.    MACE  peak  output  response  of  vehicle  la,  lb  and  2a  over  all  aspect 
angles.  Degradation     to     between     aspect     e.xemplars     is     evident. 
Generalization  to  the  testing  vehicles  as  measured  by  peak  output      ' 
response  is  also  poorer.  Vehicle  la  is  the  solid  line,  lb  is  the  dashed  line 
and  2a  is  the  dashed-dot  line. 

mum  penalty  to  the  rest.  The  solution  to  all  such  filters  can  be  characterized  by  the  equa- 


h  =  Q  'x(A-tQ  'x)"'rf. 


(2) 


where,  assuming  M  different  criteria, 


M  M 

1=1  .=1 


22 

The  possible  solutions,  parameterized  by  X, ,  define  a  performance  bound  which  can- 
not be  exceeded  by  any  linear  system  with  respect  to  the  optimization  criteria  and  the 
equality  constramts.  All  such  linear  filters  which  optimally  trade-off  a  set  of  quadratic  cri- 
teria are  referred  to  as  optimal  trade-off  synthetic  discriminant  functions. 

We  may,  for  example,  wish  to  trade-off  the  MACE  filter  criterion  versus  the  MVSDF 
filter  criterion.  This  presents  the  added  difficulty  that  one  criterion  is  specified  in  the  space 
domain  and  the  other  in  the  spectral  domain.  If  the  noise  is  represented  as  zero-mean,  sta- 
tionary, and  ergodic  (if  the  covariance  is  to  be  estimated  from  samples)  we  can,  as  men- 
tioned, transform  the  MVSDF  criterion  to  the  spectral  domain.  In  this  case  the  optimal 
filter  has  the  frequency  domain  solution, 

H  =  lXD^  +  {l-X}DJ''^XlXHXD^  +  i\-X)DJ~'x]''d 
=  D-^X{X^D-^X\^d 

where  D^  =  XD^  -\- (\ -X)D ^,  Q<X<\ ,  and  D„ ,  D^  are  diagonal  matrices  whose 

diagonal  elements  contain  the  estimated  power  spectrum  coefficients  of  the  noise  class  and 
the  recognition  class,  respectively.  The  performance  bound  of  such  a  filter  would  resemble 
figure  10,  where  all  linear  filters  would  fall  in  the  darkened  region  and  all  optimal  trade-off 
filters  would  lie  somewhere  on  the  boundary. 

By  way  of  example  we  again  use  the  data  from  the  MACE  and  SDF  examples.  In  this 
case  we  will  constmct  an  OTSDF  which  trades  off  the  MACE  filter  criterion  for  the  SDF 
criterion.  In  order  to  transform  the  SDF  to  the  spectral  domain,  we  will  assume  that  the 
noise  class  is  zero-mean,  stationary,  white  noise.  The  power  spectrum  is  therefore  flat.  One 
of  the  issues  for  constructing  an  OTSDF  is  how  to  set  the  value  of  X  which  represents  the 


23 


MVSD   -   - 


average  correlation  energy 


MAC 


Figure  10.  Example  of  atypical  OTSDF  performance  plot.  This  plot  shows  the 
trade-off,  hypothetically,  between  the  ACE  criteria  versus  a  noise 
variance  criteria.  The  curved  arrow  on  the  performance  bound  indicates 
the  direction  of  increasing  X  for  the  two  criterion  case.  The  curve  is 
bounded  by  the  MACE  and  MVSDF  results. 

degree  by  which  one  criterion  is  emphasized  over  another.  We  will  not  address  that  issue 
here,  but  simply  set  the  value  to  X  =  0.95  ,  indicating  more  emphasis  on  the  MACE  filter 


The  output  plane  response  of  the  OTSDF  is  shown  in  figure  1 1 .  As  compared  to  the 
MACE  filter  response,  the  output  peak  is  not  nearly  as  sharp,  but  still  more  localized  than 
the  SDF  case. 

The  peak  output  response  over  the  training  vehicle  for  the  OTSDF  is  compared  to  the 
MACE  filter  in  figure  12.  The  degradation  to  between  aspect  exemplars  is  less  severe  than 
the  MACE  filter.  The  peak  output  respon.se  of  vehicles  lb  and  2a  are  shown  in  figure  13. 


24 


1.0 

'' 

0.* 

o.« 

1 

o.* 

1 

0.S 

' 

JlAdml 

0.0 

^ 

^^^m 

SMillilMl  &vuiA^^^^ 

Ik 

•^^^ 

^p 

. ., 

Figure  1 1 .  OTSDF  filter  output  image  plane  response. 
As  compared  to  the  MACE  filter  the  peak  response  is  improved  over  the  testing  set.  Sepa- 
ration between  the  two  vehicle  types  appears  to  be  maintained.    . 


2.2  Pre-processor/SDF  Decomposition 
In  the  sample  domain,  the  SDF  family  of  correlation  filters  is  equivalent  to  a  cascade 
of  a  linear  pre-processor  followed  by  a  linear  correlator  [Mahalanobis  et  al.,  1987;Kumar, 
1992].  This  is  illustrated  in  figure  14  with  vector  operations.  The  pre-processor,  in  the  case 
of  the  MACE  filter,  is  a  pre-whitening  filter  computed  on  the  basis  of  the  average  power 
spectrum  of  the  recognition  class  training  exemplars.  In  the  case  of  the  MVSDF  the  pre- 
processor is  a  pre-whitening  filter  computed  on  the  basis  of  the  covariance  matrix  of  the 
noise.  The  net  result  is  that  after  pre-processing,  the  second  processor  is  an  SDF  computed 
over  the  pre-processed  exemplars. 


25 


OTSDF 


1.2 


1.0 


0.8 


OJ    0.6 


0.4 


0.2 


0.0 


20 


40  60 

aspect  angle 


80 


100 


Figure  12.  OTSDF  peak  output  response  of  vehicle  la  over  all  aspect  angles. 
Degradation  to  between  aspect  exemplars  is  less  than  in  the  MACE 
filter  shown  in  dashed  line. 

The  primary  contribution  of  this  research  will  be  to  extend  the  ideas  of  MACE  filtering 
to  a  general  nonlinear  signal  processing  architecture  and  accompanying  classification 
framework.  These  extensions  will  focus  on  processing  structures  which  improve  the  gen- 
eralization and  discrimination  properties  while  maintaining  the  shift-invariance  and  local- 
ization detection  properties  of  the  linear  MACE  filter. 


26 


OTSDF 


1.2 


1.0 


0.8 


f    0.6 


0.4 


0.2 


o.oL 


/   -.  V 


>,,       V      /  ■;  I-   i; 


20 


40  60 

aspect  angle 


80 


100 


Figure  13.  OTSDF  peak  output  response  of  vehicles  lb  and  2a  over  all  aspect 
angles.  Generalization  is  better  than  in  the  MACE  filter.  Vehicle  lb  is 
shown  in  dashed  line,  vehicle  2a  is  shown  in  dashed-dot  line. 


1 

y  =  Ax 

pre -process or 

SDF 

input  image,  x 

1             scalar  output 

Filter 

Decompo. 

ition 

Figure  14.  Decomposition  of  distortion  invariant  filter  in  space  domain.  The 

notation  used  assumes  that  the  image  and  filter  coefficients  have  been 
re-ordered  into  vectors.  The  input  image  vector,  x ,  is  pre-processed 
by  the  linear  transformation,  y  =  Ax .  The  resulting  vector  is 
processed  by  a  synthetic  discriminant  function,  y    ,  =  y^h  . 


CHAPTER  3 
THE  MACE  FILTER  AS  AN  ASSOCIATIVE  MEMORY 
3. 1  Linear  Systems  as  Classifiers 

In  this  chapter  we  present  the  MACE  filter  from  the  perspective  of  associative  memo- 
ries. This  perspective  is  important  because  it  leads  to  a  machine-learning  and  classification 
framework  and  consequently  a  means  by  which  to  determine  the  parameters  of  a  nonlinear 
mapping  via  gradient  search  techniques.  We  shall  refer,  herein,  to  the  machine  learning/ 
gradient  search  methods  as  an  iterative  framework.  The  techniques  are  iterative  in  the 
sense  that  adaptation  to  the  mapping  parameters  are  computed  sequentially  and  repeatedly 
over  a  set  of  exemplars.  We  shall  show  that  the  iterative  and  classification  framework  com- 
bined with  a  nonlinear  system  architecture  have  distinct  advantages  over  the  linear  frame- 
work of  distortion  invariant  filters.  "     ;• 

As  we  have  stated,  distortion  invariant  filters  can  only  realize  linear  discriminant  func- 
tions. We  begin,  therefore,  by  considering  linear  systems  used  as  classifiers.  The  adaline 
architecture  [Widrow  and  Hoff,  1960],  depicted  m  figure  15,  is  an  example  of  a  linear  sys- 
tem used  for  pattern  classification.  A  pattern,  represented  by  the  coefficients  x- ,  is  applied 

to  a  linear  combiner,  represented  by  the  weight  coefficients  w, ,  the  resulting  output  v  is 


27 


28 


then  applied  to  a  hard  limiter  which  assigns  a  class  to  the  input  pattern.  Mathematically 
this  can  be  represented  by 


c  =  sgn(y-cp) 
=  sgn(«''''j:-(p)' 

where  sgn(  )  is  the  signum  function,  tp  is  a  threshold,  and  w,  jc  s  "iX^^  '  are  column 
vectors  containing  the  coefficients  of  the  pattern  and  combiner  weights,  respectively.  In 
the  context  of  classification,  this  architecture  is  trained  iteratively  using  the  least  mean 
square  (LMS)  algorithm  [Widrow  and  Hoff,  I960].  For  a  two  class  problem  the  desired 
output,  d  in  the  figure,  is  set  to  ±1  depending  on  the  class  of  the  input  pattern,  the  LMS 
algorithm  then  minimizes  the  mean  square  error  (MSE)  between  the  classification  output 
c  and  the  desired  output.  Since  the  error  function,  e^,  can  only  take  on  three  values  ±2 

and  0,  minimization  of  the  MSE  is  equivalent  to  minimizing  the  average  number  of  actual 
errors. 


Figure  15.  Adaline  architecture 


29 
There  are  several  observations  to  be  made  about  the  adaline/LMS  approach  to  classifi- 
cation. One  observation  is  tliat  the  adaptation  process  described  uses  the  error,  e ,  as  mea- 
sured at  the  output  of  the  linear  combiner  to  drive  the  adaptation  process  and  not  the  actual 
classification  error,  e^ .  Another  observation  is  that  this  approach  presupposes  that  the  pat- 
tern classes  can  be  linearly  separated.  A  final  point,  on  which  we  will  have  more  to  say,  is 
that  the  method  uses  the  MSE  criterion  as  a  proxy  for  classification. 

3.2  MSE  Criterion  as  a  Proxv  for  Classification  Performance 
As  we  have  pointed  out,  the  adaline/LMS  approach  to  classification  uses  the  MSE  cri- 
terion to  drive  the  adaptation  process.  It  is  the  probability  of  misclassification  (also  called 
the  Bayes  criterion),  however,  with  which  we  are  truly  concerned.  We  now  discuss  the 
consequence  of  using  the  MSE  criterion  as  a  proxy  for  classification  performance. 

it  is  well  known  that  the  discriminant  function  that  minimizes  misclassification  is 
monotonically  related  to  the  posterior  probability  distribution  of  the  class,  c ,  given  the 
observation  x  [Fukanaga,  1990].  That  is,  for  the  two  class  problem,  if  the  di.scriminant 
function  is 

fix)  =  P2p(Cj|a:),  (3) 

where  Pj  '^  'he  prior  probability  of  class  2,  and  piCjlx)  is  the  conditional  probability 

distribution  of  class  2  given  x ,  then  the  probability  of  classification  will  be  minimized  if 
the  following  decision  rule  is  used 


fix)  <  0.5  choose  class  1  „. 

(4) 


fix)  >  0.5  choose  class  2 


30 
For  the  case  of  f{x)  =  0.5  ,  both  classes  are  equally  likely,  so  a  guess  must  be  made. 

"  .ii-  3.2.1  Unrestricted  Functional  Mappings 
-"■tf-  With  regards  to  the  adaline/LMS  approach  we  now  ask,  what  is  the  consequence  of 

■^  using  the  MSE  criterion  for  computing  discriminant  functions?  In  the  two  class  case,  the 

■'ffi  source  distributions  are  p(jr|C|)  or/)(;i:|C2)  depending  on  whether  the  observation,  x,  is 

■•  t  drawn  from  class  1  or  class  2,  respectively.  If  we  assign  a  desired  output  of  zero  to  class  1 

.  :'E..*  and  unity  to  class  2  then  the  MSE  criterion  is  equivalent  to  the  following 

-  •/(/)  =  y£{/W|C,}  +  y'£{(l-/U))2|C2},  (5) 

V  <■  where  the  1/2  scale  factors  are  for  convenience,  E{    }  is  the  expectation  operator,  and 

-  *'  C,  indicates  class  / . 

For  now  we  will  place  no  constraints  on  the  functional  form  of  f{x) .  In  so  doing,  we 

.*?.  can  solve  for  the  optimal  solution  using  the  calculus  of  variations  approach.  In  this  case, 

--'  we  would  like  to  find  a  stationary  point  of  the  criterion  /(/)  due  to  small  perturbations  in 

—  the  function  f{x)  indicated  by 


87  =  /(/  +  5/)-y(/) 

=  0  (*> 


(7) 


31 
The  first  term  of  6  can  be  computed  as 

^(/  +  5/)  =  y£{(/  +  S/)2|C,}  +  --'£{(l-/-5/)2|C2} 
=  ^£{(/2  +  2/S/)|C,} 

+  Y£{({1-/)2-2(1-/)5/)|C2}  +  0(5/2) 
=  ^(/)  +  y£{(2/5/)|CV}-y'£{(2(I-/)5/)|C2} 
which  can  be  substituted  into  6  to  yield 

5/  =  P,£{/5/|C,}-/'2£{()-/)5/|C2} 

=  /',_Q/W8/pU|C,W..-f>2_Q(l  -  f(xmp{x\C^)dx 

=  fjif(x)px(x)-P^p(x\C^)Wdx  .  ••     ," 

where  p^ix)  =  P,/7(jr|C|)  +  P2p{x\C2)  is  the  unconditional  probability  distribution  of 
the  random  variable  X.  In  order  for  f{x)  to  be  a  stationary  point  of  Jif) ,  equation  8 
must  be  zero  over  all  x  for  any  arbitrary  perturbation  5/(x) .  Consequently 

f{x)px{x)-P2P{x\C2)  =  0  (9) 


(8) 


32 


fix)  = 


P2P(X\C2) 

P2P(X\C2)  (10) 

P,p(.x\C0  +  P2Pix\C2) 
piC^M 


which  is  the  likelihood  that  the  observation  is  drawn  from  class  2.  If  we  had  reversed  the 
desired  outputs,  the  result  would  have  been  the  likelihood  that  the  observation  was  drawn 
from  class  1 .  This  result,  predicated  by  our  choice  of  desired  outputs,  shows  that  for  arbi- 
trary fix) ,  the  MSE  criterion  is  equivalent  to  probability  of  misclassification  error  crite- 
rion. In  fact,  it  has  been  shown  by  Richard  and  Lippman  [1991]  (using  other  means)  for 
the  multi-class  case  that  if  the  desired  outputs  are  encoded  as  vectors,  e^  e  iR'^  '^ ' ,  where 

the  ith  element  is  unity  and  the  others  are  zero,  for  an  N-class  problem  the  MSE  criterion 
is  equivalent  to  optimizing  the  Bayes  criterion  for  classification. 

3.2.2  Parameterized  Functional  Mappings 

Suppose,  however,  that  the  function  is  not  arbitrary,  but  is  also  a  function  of  parameter 
set,  a ,  as  in  f{x,  a) .  The  MSE  criterion  of  5  can  be  rewritten  i    ■    •  ', '      ' 


Af)  -  y£{/(x,a)2|C,}  +  -^£{(l-/(x,a))2|C2}. 


The  gradient  of  the  criterion  with  respect  to  the  parameters  becomes 


Cll) 


|£  =  P,£/U„)^/(,,„j 


P2E{{\-f(x,a))^f{x.a)\C. 


(12) 


33 

and  consequently 


^  =  flf_J(^'a)^fi^'0.)p{x\C^)dx 


-^ifj^-fi^,  a))l^/(jt,  a)pix\C2)dx 
=  fjfi^,  a)(P,p{x\C,)  +  P.pix\C^)}  -  {P^p{x\C{))^^f{x,  a)dx 


Examination  of  equation  13  allows  for  two  possibilities  for  a  stationary  point  of  the  crite- 
rion. The  first,  as  before,  is  that 

P2Pix\C^) 
j(x,a)  =  — —^ 

Px'^^)     ,  (14) 

=  piC^lx) 

while  the  second  is  if  we  are  near  a  local  minima  with  respect  to  a .  In  other  words,  if  the 
parameterized  function  can  realize  the  Bayes  discriminant  function  via  an  appropriate 
choice  of  its  parameters,  then  this  function  represents  a  global  minima,  but  this  does  not 
discount  the  fact  that  there  may  be  local  minima.  Furthermore,  if  the  parameterized  func- 
tion is  not  capable  of  representing  the  Bayes  discriminant  function  there  is  no  guarantee 
that  the  global  (or  local)  minima  will  result  in  robust  classification.  i 


34 

3.2.3  Finite  Data  Sets 

The  previous  development  does  not  take  into  account  that  in  an  iterative  framework  we 
are  working  with  observations  of  a  random  variable.  Therefore,  we  rewrite  the  criterion  of 
equation  5  as  finite  summations.  That  is.  the  criterion  becomes 

•/(/{a:,  a))  =  f^  X  /(^ra)^*-^  Z  (l-/K-.a))2,  (15) 

where  Xj  e  C-  denotes  the  set  of  observations  taken  from  class  C, .  Taking  the  derivative 
of  this  criterion  with  respect  to  the  parameters,  a ,  yields 

1^  =  Pi  Z  /K.  «)4/U,.  a)-p2  Z  (l-/(A:,,a))A/(^..  a).  (16) 

Xj  e  C,  xi  e  Cj 

It  is  assumed  that  the  set  of  observations  from  class  C,  (jr,  s  C, )  are  independent  and 
identically  distributed  (i.i.d.),  as  are  the  set  of  observations  from  class  Cj  (jr,  e  Cj) 
although  with  a  different  distribution  than  cla.ss  C,  .  Since  the  summation  terms  are  bro- 
ken up  by  class,  we  can  assume  that  the  arguments  of  the  summations  (functions  of  dis- 
tinct i.i.d.  random  variables)  are  themselves  i.i.d.  random  variables  [Papoulis,  1991].  If  we 
set  PiW,  =  P|  and  fi2^l  ~  ^2<  *here  P^  and  Pj  ^i"^  'he  prior  probabilities  of  classes 
C|  and  Cj,  respectively,  and  N^  and  Nj  are  the  number  of  samples  from  drawn  from 


35 
each  of  the  classes,  we  can  use  the  law  of  large  numbers  to  say  that  the  summations  of 
equation  16  approach  their  expected  values.  In  other  words,  in  the  limit  as  N^,  Wj  — >  "> 


^  =  P,j/ua)^4^|c,    -pJ(l-/(.,a))^to)|c,L  07, 


which  is  identical  to  equation  12  and  so  yields  the  same  solution  for  the  mapping  as 

fix,a)  =  ^^--\^.  (18) 

The  conclusion  is  that  if  we  have  a  sufficient  number  of  observations  to  characterize 
the  underlying  distributions  then  the  MSE  criterion  is  again  equivalent  to  the  Bayes  crite- 


3.3  Derivation  of  the  MACE  Filter 
We  have  already  introduced  the  MACE  filter  in  a  previous  section.  We  present  a  deri- 
vation of  the  MACE  filter  here.  The  development  is  similar  to  the  derivations  given  in 
Mahalanobis  [1987]  and  Kumar  [1992].  Our  purpose  in  this  presentation  of  the  derivation 
is  that  it  serves  to  illustrate  the  associative  memory  perspective  of  optimized  correlators;  a 
perspective  which  will  be  used  to  motivate  the  development  of  the  nonlinear  extensions 
presented  in  later  sections. 


36 
In  the  original  development,  SDF  type  filters  were  formulated  using  correlation  opera- 
tions, a  convention  which  will  be  maintained  here.  The  output,  g(n ,,  nj) ,  of  a  correlation 
filter  is  determined  by 

N,-  1  /V3-  I 
/n,  =  Omj  =  0 

x*in,,n2)**hin^,n2)  \ 

where  j:*(n|,  nj)  is  the  complex  conjugate  of  an  input  image  with  N^x  N2  region  of  sup- 
port, /!(«!,  nj)  represents  the  filter  coefficients,  and  **  represents  the  two-dimensional 
circluar  convolution  operation  [Oppenheim  and  Shafer,  1989]. 

The  MACE  filter  formulation  is  as  follows  [Mahalanobis  et  al.,  1987J.  Given  a  set  of 

image  exemplars,  {x^e  9?"'^'*'^  /=  1...^, },  we  wish  to  find  filter  coefficients, 
h  a  31   '  ^    • ,  such  that  average  correlation  energy  at  the  output  of  the  filter  defined  as 


N,   ^  /-Ni-IN;-] 


■■2f- 


is  minimized  subject  to  the  constraints 


(19) 


A, 


Ni-\  N2-I 
gii^'^)  =     S      L   x^*(m^,m2)h{m^,m2)  =  d-,      i  =   1.../V,.  (20) 

m,  =  Omj  =  0 

Mahalanobis  [1987]  reformulates  this  as  a  vector  optimization  in  the  spectral  domain 
using  Parseval's  theorem.  In  the  spectral  domain  we  wish  to  find  the  elements  of 

H  £  c'^'''^  ^  '  a  column  vector  whose  elements  are  the  2-D  DFT  coefficients  of  the  space 


37 
domain  filter  h  reordered  lexicographically.  Let  the  columns  of  the  data  matrix 
X  e  C   '   ^      '  contain  the  2-D  DFT  coefficients  of  the  exemplars  (x, x^  }  also 

reordered  into  column  vectors.  The  diagonal  matrix  D;  6  9J  '  ^  '  ^  contains  the  mag- 
nitude squared  of  the  2-D  DFT  coefficients  of  the  i  exemplar.  These  matrices  are  aver- 
aged to  form  the  diagonal  matrix  D  as 


«  =  ^x^- 


(21) 


which  then  contains  the  average  power  spectrum  of  the  training  exemplars.  Minimizing 
equation  (19)  subject  to  the  constraints  of  equation  (20)  is  equivalent  to  minimizing 

H^H.  (22) 

subject  to  the  linear  constraints 

X^H  =  d  (23) 

N  X  I 

where  the  elements  of  d  e  S{   '       are  the  desired  outputs  corresponding  to  the  exemplars. 
The  solution  to  this  optimization  problem  can  be  found  using  the  method  of  Lagrange 
multipliers.  In  the  spectral  domain,  the  filter  that  satisfies  the  constraints  of  equation  (20) 
and  minimizes  the  criterion  of  equation  (19)  [Mahalanobis  et  al.,  1987;Kumar,  1992]  is 

H  =  £)-'^(Zt£»-'Ar)-'d,  (24) 

WiiV,  xl 

where  H  e  C       '       contains  the  2D-DFT  coefficients  of  the  filter,  assuming  a  unitary  2- 


DDFT. 


38 

3.3.1  Pre-processor/SDF  Decomposition 

As  observed  by  Mahalanobis  [1987],  the  MACE  filter  can  be  decomposed  as  a  syn- 
thetic discriminant  function  preceded  by  a  pre-whitening  filter.  Let  the  matrix 

-1/2 

B  =  D  ,  where  B  is  diagonal  with  diagonal  elements  equal  to  the  inverse  of  the 
square  root  of  the  diagonal  elements  of  B .  We  implicitly  assume  that  the  diagonal  ele- 
ments of  D  are  non-zero,  consequently  B'^B  =  D  and  B^  =  B .  Equation  (24)  can 
then  be  rewritten  as 

H  =  BiBX)(iBXV{BX))d.  (25) 

Substituting  Y  =  BX ,  representing  the  original  exemplars  preprocessed  in  the  spec- 
tral domain  by  the  matrix  B ,  equation  (25)  can  be  written 

H  =  BY(Y^Y)d.  (26) 

The  term  H'  =  Y{Y^Y}d  is  recognized  as  the  SDF  computed  from  the  preprocessed 
exemplars  Y.  The  MACE  filter  solution  can  therefore  be  written  as  a  cascade  of  a  pre- 
whitener  (over  the  average  power  spectrum  of  the  exemplars)  followed  by  a  synthetic  dis- 
criminant function,  depicted  in  figure  16,  as 

H  =  BH\  (27) 


1.  If  the  DPT  were  as  defined  in  [Oppenheim  and  Shafer,  1989]  then  a  scale  factor  of 
N yNj  would  be  necessary. 


39 


y„  =  y'n 

H  =  Y(Y^Y)d 


Vo 


uumMm/Mmttmrnmammm 
■/SDFFilter  Decomposition 


Figure  16.  Decomposition  of  MACE  filter  as  a  preprocessor  (i.e.  a  pre- 
whitening  filter  over  the  average  power  spectrum  of  the 
exemplars)  followed  by  a  synthetic  discriminant  function. 

3.4  Associative  Memory  Perspective 

Having  presented  the  derivation  of  the  MACE  filter  and  the  pre-processor/SDF  decom- 
position, we  nov/  show  that  with  a  modification  (addition  of  a  linear  pre-processor),  the 
MACE  filter  is  a  special  case  of  Kohonen's  linear  associative  memory  [1988]. 

Associative  memories  [Kohonen,  1988]  are  general  structures  by  which  pattern  vec- 
tors can  be  related  to  one  another,  typically  in  an  input/output  pair-wise  fashion.  An  input 
stimulus  vector  is  presented  to  the  associative  memory  structure  resulting  in  an  output 
response  vector.  The  input/output  pairs  establish  the  desired  response  to  a  given  input.  In 
the  case  of  an  auto-associative  memory,  the  desired  response  is  the  stimulus  vector, 
whereas,  in  a  hetero-associative  memory  the  desired  response  is  arbitrary.  From  a  signal 
processing  perspective,  associative  memories  are  viewed  as  projections  [Kung,  1992],  lin- 
ear and  nonlinear.  The  input  patterns  exist  in  a  vector  space  and  the  associative  memory 
projects  them  onto  a  new  space.  The  linear  associative  memory  of  Kohonen  [1988]  is  for- 
mulated exactly  in  this  way.  , 

A  simple  form  of  the  linear  hetero-associative  memory  maps  vectors  to  scalars.  It  is 
formulated  as  follows.  Given  the  set  of  input/output  vector/scalar  pairs 


40 


{xe  5R^^',J,. e  %i  =  \...N,},  which  are  placed  into  a  input  data  matrix, 
X  =  [X|... jc^y],  and  desired  output  vector,  <7  =  [(/,.. .d;^,_]  ,  find  the  vector,  ft  e  9?  **  , 
such  that 


x+A  =  d  (28) 

If  the  system  of  equations  described  by  (28)  is  under-determined  the  inner  product 

h^k  (29) 

is  minimized  using  (28)  as  a  constraint.  If  the  system  of  equations  are  over-determined 

{xih-d)Hx^h-d) 
is  minimized. 

Here,  we  are  interested  in  the  under-determined  case.  The  optimal  solution  for  the 

under-determined,  using  the  pseudo-inverse  of  x  is  [Kohonen,  1988] 

h  =  x(x+x)~'rf.  (30) 

As  was  shown  in  [Fisher  and  Principe,  1994],  we  can  modify  the  linear  associative 
memory  model  slightly  by  adding  a  pre-processing  linear  transformation  matrix,  A  ,  and 
find  h  such  that  the  under-determined  system  of  equations 

(Ax)tft  =  d  (31) 


41 

is  satisfied  while  /|t/i  is  minimized.  As  in  the  MACE  filter,  this  optimization  can  be 
solved  using  the  method  of  Lagrange  multipliers.  We  adjoin  the  system  of  constraints  to 
the  optimization  criterion  as 


J  =  h'^h  +  X^HAxjUi-d)  ■  (32) 

N  X  1 

where  A.  6  9?   '       is  a  column  vector  of  Lagrange  multipliers,  one  for  each  constraint 
-(desired  response).  Talcing  the  gradient  of  equation  (32)  with  respect  to  the  vector  h  yields 


^  =  2h+AxX.  (33) 

an 

Setting  the  gradient  to  zero  and  solving  for  the  vector  h  yields 

k  =  --AxX.  (34) 

Substituting  this  result  into  the  constraint  equations  of  (31)  and  solving  for  the  Lagrange 
multipliers  yields 

X  =  -2i(Ax)^Axf^d.  (35) 

Substituting  this  result  back  into  equation  (34)  yields  the  final  solution  to  the  optimization 
as 

h  =  Axix^A^Axf^d.  (36) 

If  the  pre-processing  transformation,  A  ,  is  the  space-domain  equivalent  of  the  MACE 

filter's  spectral  pre-whitener  and  the  columns  of  the  data  matrix  x  contain  the  re-ordered 

elements  of  the  images  from  the  MACE  filter  problem  then  equation  (36)  combined  with 


42 
the  pre-processing  transformation  yields  exactly  the  space  domain  coefficients  of  the 
MACE  filter.  This  can  be  shown  using  a  unitary  discrete  Fourier  transformation  (DFT) 
matrix. 

N  X  N  N  X  N 

If  U  e  C   '      Ms  the  DFT  of  the  image  m  e  9?   '      \  we  can  reorder  both  U  and  u 

mto  column  vectors,  U  e  C  and  u  e  C  ,  respectively.  We  can  then  imple- 

ment the  2-D  DFT  as  a  unitary  transformation  matrix,  <S> ,  such  that 

In  order  for  the  transformation  A  to  be  the  space  domain  equivalent  of  the  spectral  pre- 
whitener  of  the  MACE  filter,  the  relationship 

Ax  =  oty 

=  O^BX 

where  B  is  the  same  matrix  as  in  equation  27,  must  be  true  which,  by  inspection,  means 
that 

A  =  OtBo  (37) 

Substituting  equation  (37)  into  equation  (36)  and  using  the  property  B^B  =  BB  =  D~ 
yields 


h  =  Axix^A'iAx)  'rf 

=  <I)tB(Dj:(jt((DtBO)tOtB<l)j:)"'rf 

.  ,  (38) 

=  C)tBOx(x+OtB<I)<DtBcDi:)    d 

=  ^^BX(XW'x)~'d 


43 


combining  this  solution  for  h  with  the  pre-processor  in  equation  (31)  for  the  equivalent 


linear  system,  h     ,  yields 


Ah 

(DTBto<DtBA:(A't/>"'A:)~'rf 
(^W^X{XW^X)~^d 


Substituting  the  MACE  filter  solution,  equation  (24),  gives  the  result 


=  Ottf^ 


(39) 


and  so  h  is  the  inverse  DFT  pair  of  the  spectral  domain  MACE  filter.  This  result  estab- 
lishes the  relationship  between  the  MACE  filter  and  the  linear  associative  memory.  The 
decomposition  of  the  MACE  filter  of  figure  1 6  can  also  be  considered  as  a  cascade  of  a  lin- 
ear pre-processor  followed  by  a  linear  associative  memory  (LAM)  as  in  figure  17. 


c^mmmm 


yo 


Figure  17.  Decomposition  of  MACE  filter  as  a  preprocessor  (i.e.  a  pre- 
whitening  filter  over  the  average  power  spectrum  of  the  exemplars) 
followed  by  a  linear  associative  memory. 


Since  the  two  are  equivalent  then  why  make  the  distinction  between  the  two  perspec- 
tives? The  are  several  reasons.  The  development  of  distortion  invariant  filtering  and  asso- 
ciative memories  has  proceeded  in  parallel.  Distortion  invariant  filtering  has  been 


44 
concerned  with  finding  projections  which  will  essentially  detect  a  set  of  images.  Towards 
this  goal  the  techniques  have  emphasized  analytic  solutions  resulting  in  linear  discrimi- 
nant functions.  Advances  have  been  concerned  with  better  descriptions  of  the  second  order 
statistics  of  the  causes  of  false  detections.  The  approach,  however,  is  still  a  data  driven 
approach.  The  desired  recognition  class  is  represented  through  exemplars.  In  the  distortion 
invariant  filiering  approach,  the  task  has  been  confined  to  fitting  a  hyper-plane  to  the  rec- 
ognition exemplars  subject  to  various  quadratic  optimization  criterion. 

The  development  of  associative  memories  has  proceeded  along  a  different  track.  It  is 
also  data  driven,  but  the  emphasis  has  been  on  iterative  machine  learning  methods.  Many 
of  the  methods  are  biologically  motivated,  including  the  perceptron  learning  rule  [Rosenb- 
latt, 1958]  and  Hebbian  learning  [Hebb,  1949].  Other  methods,  including  the  least-mean- 
square  (LMS)  algorithm  [Widrow  and  Hoff,  1960]  (which  we  have  described)  and  the 
backpropagation  algorithm  [Rumelhart  et  al.,  1986;  Werbos  ,  1974],  are  gradient  descent 
based  methods. 

From  the  classification  standpoint,  of  which  the  ATR  problem  is  a  subset,  iterative 
methods  have  certain  advantages.  This  can  be  illustrated  with  a  simple  example.  Suppose 
the  data  matrix 

were  not  full  rank.  In  other  words  the  exemplars  representing  the  recognition  class  could 
be  represented  without  error  in  a  subspace  of  dimension  less  than  A', .  From  an  ATR  per- 
spective this  would  be  a  desirable  property.  The  implicit  assumption  in  any  data  driven 
method  is  that  information  about  the  recognition  class  is  transmitted  through  exemplars. 
This  is  as  true  for  distortion  invariant  filters,  which  have  analytic  solutions,  as  it  is  for  iter- 


45 

ative  methods.  The  smaller  tlie  dimension  of  the  subspace  in  which  the  recognition  class 
lies,  the  better  we  can  discriminate  images  considered  to  be  out  of  the  class.  One  limitation 
of  the  analytic  solutions  of  distortion  invariant  filters  is  that  they  require  the  inverse  of  a 
matrix  of  the  form 

x^Qx,  (40) 

where  Q  kUl  positive  definite  matrix  representing  a  quadratic  optimization  criterion.  If  the 
matrix,  x ,  is  not  full  column  rank  there  is  no  inverse  for  the  matrix  of  (40)  and  conse- 
quently no  analytic  solution  for  any  of  the  distortion  invariant  filters.  The  LMS  algorithm, 
however,  will  still  find  a  best  fit  to  the  design  goal,  which  is  to  minimize  the  criterion  while 
satisfying  the  linear  constraints. 

We  can  illustrate  this  by  modifying  the  data  from  the  experiments  in  section  2.1.  It  is 
well  known  that  the  data  matrix  x  can  be  decomposed  using  the  singular  value  decompo- 
sition (SVD)  as 


X  =  f/Av'^, 
where  the  columns  of  t/  G  9?  form  an  ortho-normal  basis  (the  principal  components 

N  X  (V 

of  the  vector  jt,  in  fact),  the  diagonal  matrix  A  6  9?   '      '  contains  the  singular  values  of 

N  X  W 

the  data  matrix,  and  V  s  95  '  'is  unitary.  The  columns  of  the  data  matrix  can  be  pro- 
jected onto  a  subspace  by  setting  one  of  the  diagonal  elements  of  A  to  zero.  The  impor- 
tance of  any  of  the  basis  vectors  in  U  is  directly  proportional  to  the  singular  value.  In  this 
case  Af,  =  21  so  we  can  choo.se  one  of  the  smaller  singular  values  to  set  to  zero  without 


46 

changing  basic  structure  of  the  data.  For  this  example  we  choose  the  twelfth  largest  singu- 
lar value.  A  data  matrix  x^^^^  is  generated  by 


0       0       0 


T 

V  , 


where  A,_    is  a  diagonal  matrix  containing  the  i  through  j  singular  values  of  the  original 
data  matrix  .r . 

This  data  matrix  is  not  full  rank,  so  there  is  no  analytical  solution  for  the  MACE  filter, 
however  we  can  use  the  LMS  approach  and  derive  a  linear  associative  memory.  The  col- 
umns of  jTj^j,  are  pre-processed  with  a  pre-whitening  filter  computed  over  the  average 

power  spectrum.  The  LMS  algorithm  can  then  be  used  to  iteratively  compute  the  transfor- 
mation that  best  fits 

in  a  least  squares  sense;  that  is,  we  can  find  the  h  that  minimizes 

where  d  is  column  vector  of  desired  responses  (set  to  all  unity  in  this  case). 

The  peal;  output  response  for  this  filter  was  computed  over  all  of  the  aspect  views  of 
vehicle  la  and  is  shown  in  figure  18.  The  exemplars  used  to  compute  the  filter  are  plotted 
with  diamond  symbols.  The  desired  response  cannot  be  met  exactly  so  a  least  squares  fit  is 
achieved.  Figure  19  shows  the  correlation  output  surface  for  one  of  the  training  e.xemplars. 


47 


MACE  filter  (LMS) 


1.2 


1.0 


0.8 


fJ    0.6 


0.4 


0.2 


0.0 


20 


40  60 

aspect  angle 


80 


100 


Figure  18.  Peak  output  response  over  all  aspects  of  vehicle  la  when  the  data 
matrix  which  is  not  full  rank.  The  LMS  algorithm  was  used  to  compute 
the  filter  coefficients. 

As  can  be  seen  in  the  image,  the  qualities  of  low  variance  and  localized  peak  are  still 
maintained  using  the  iterative  method. 

The  learning  curve,  which  measures  the  normalized  mean  square  error  (NMSE) 
between  the  filter  output  and  the  desired  output,  is  shown  as  a  function  of  the  learning 
epoch  (an  epoch  is  one  pass  through  the  data)  in  figure  20.  When  the  data  matrix  is  full 
rank,  as  shown  with  a  solid  line,  we  see  that  since  there  is  an  exact  solution  and  the  error 
approaches  zero.  When  x^^^^  is  u.sed  the  NMSE  approaches  a  limit  because  there  is  no 
exact  solution  and  so  a  least  squares  solution  is  found. 


48 


0.' 

0.0 

- 

^ 

Figure  19.  Output  correlation  surface  for  LMS  computed  filter  from  non  full  rank 
data.  The  filter  output  is  not  substantially  different  from  the  analytic 
solution  with  full  rank  data. 

Since  the  system  of  constraint  equations  are  generally  under-determined,  there  are  infi- 
nitely many  filters  which  will  satisfy  the  constraints.  There  is  only  one,  however,  that  min- 
imizes the  norm  of  filter  (the  optimization  criterion  after  pre-processing)  [Kohonen,  1988]. 
Figure  2 1  shows  the  NMSE  between  the  analytic  solution  for  the  filter  coefficients  as  com- 
pared to  the  iterative'  method.  When  the  data  matrix  is  full  rank  the  iterative  method 
approaches  the  optimal  analytic  solution,  as  shown  by  the  solid  line  in  the  figure.  When 
the  data  matrix  is  not  full  rank,  as  shown  by  the  dashed  line  in  the  figure,  the  error  in  the 
iterative  solution  approaches  a  limit. 

These  qualities  of  iterative  learning  methods  are  important  from  the  ATR  perspective. 
We  see  from  the  example  that  when  the  data  possesses  a  quality  that  would  seemingly  be 


1 .  in  this  case  "iterative"  refers  to  the  LMS  algorithm,  within  this  text  it  generally  refers  to 
a  gradient  search  algorithm. 


49 


n 

LMS  learning  curves 

lo" 

s^ 

1 

1 

: 

10-1 

r\ 

1 

10-2 

r 

- 

0) 

c 

- 

- 

10-4 

- 

- 

10-^ 

- 

\ 

- 

10-6 

.,    \,     ,,,,,,     , 

' 

c 

) 

20 

40                   60 

80 

100 

epoch 

Figure  20.  Learning  curve  for  LMS  approach.  The  learning  curve  for  the  LMS 
algorithm  when  the  full  rank  data  matrix  is  shown  with  a  solid  line,  the 
non  full  rank  case  is  shown  with  a  dashed  line. 

useful  to  the  ATR  problem,  namely  that  the  class  can  be  described  by  a  sub-space,  the  ana- 
lytic solution  fails  when  the  number  of  exemplars  exceeds  the  dimensionality  of  the  sub- 
space.  The  iterative  method,  however,  finds  a  reasonable  solution.  Furthermore,  if  the  data 
matrix  is  full  rank,  the  iterative  method  approaches  the  optimal  analytic  solution. 


3.5  Comments 
There  are  further  motivations  for  the  associative  memory  perspective  and  by  extension 
the  use  of  iterative  methods.  It  is  well  known  that  non-linear  associative  memory  struc- 
tures can  outperform  their  linear  counterparts  on  the  basis  of  generalization  and  dynamic 
range  [Kohonen,  I988;Hinton  and  Anderson,  1981].  In  general,  they  are  more  difficult  to 
design  as  their  parameters  cannot  be  computed  analytically.  The  parameters  for  a  large 


50 


Figure  21.  NMSE  between  closed  form  solution  and  iterative  solution.  The 

learning  curve  for  the  LMS  algorithm  when  the  full  rank  data  matrix  is 
shown  with  a  solid  line,  the  non  full  rank  case  is  shown  with  a  dashed 
line. 

class  of  nonlinear  associative  memories  can,  however,  be  determined  by  gradient  search 

techniques.  The  methods  of  distortion  invariant  filters  are  limited  to  linear  or  piece-wise 

linear  discriminant  functions.  It  is  unlikely  that  these  solutions  are  optimal  for  the  ATR 

problem. 

In  this  chapter  we  have  made  the  connection  between  distortion  invariant  filtering  and 
linear  associative  memories.  Furthermore  we  have  motivated  an  iterative  approach.  Recall 
figure  15,  which  shows  the  adaline  architecture.  In  this  architecture  we  can  use  the  linear 
error  term  in  order  to  train  our  .system  as  a  classifier.  This  is  consequence  of  the  assump- 
tion that  a  linear  discriminant  function  is  desirable.  If  a  linear  discriminant  function  is  sub- 


51 

optimal,  which  will  almost  always  be  the  case  for  any  high-dimensional  classification 
problem,  then  we  must  work  directly  with  the  classification  error. 

We  have  also  shown  that  the  MSE  criterion  is  a  sufficient  proxy  for  classification  error 
(with  certain  restrictions),  however,  it  requires  that  we  work  with  the  true  output  error  of 
the  mapping  as  well  as  a  mapping  with  sufficient  flexibility  (i.e.  can  closely  approximate  a 
wide  range  of  functions  which  are  not  necessarily  linear).  The  linear  systems  approach, 
however,  does  not  allow  for  either  of  these  requirements.  Consequently,  we  must  adopt  a 
nonlinear  systems  approach  if  we  hope  to  achieve  improved  performance.  The  next  chap- 
ter will  show  that  the  MACE  filter  can  be  extended  to  nonlinear  systems  such  that  the 
desirable  properties  of  shift  invariance  and  localized  detection  peak  are  maintained  while 
achieving  superior  classification  performance. 


CHAPTER  4 

STOCHASTIC  APPROACH  TO  TRAINING  NONLINEAR 
SYNTHETIC  DISCRIMINANT  FUNCTIONS 

4. 1    NnnlinRar  itCTativp,  Approach 

The  MACE  filter  is  the  best  linear  system  that  minimizes  the  energy  in  the  output  cor- 
relation plane  subject  to  a  peak  constraint  at  the  origin.  An  advantage  of  linear  systems  is 
that  we  have  the  mathematical  tools  to  use  them  in  optimal  operating  conditions  from  the 
standpoint  of  second  order  statistics.  Such  optimality  conditions,  however,  should  not  be 
confused  with  the  best  possible  classification  performance. 

Our  goal  is  to  extend  the  optimality  condition  of  MACE  filters  to  adaptive  nonlinear 
systems  and  classification  performance.  The  optimality  condition  of  the  MACE  filter  con- 
siders the  entire  output  plane,  not  just  the  response  when  the  image  is  centered.  With 
regards  to  general  nonlinear  filter  architectures  which  can  be  trained  iteratively,  a  brute 
force  approach  would  be  to  train  a  neural  network  with  a  desired  output  of  unity  for  the 
centered  images  and  zero  for  all  shifted  images.  This  would  indeed  emulate  the  optimality 
of  the  MACE  filter,  however,  the  result  is  a  training  algorithm  of  order  WjAfjAf,  for  A', 

training  images  of  size  A',  x  N^  pixels.  This  is  clearly  impractical. 

In  this  section  we  propose  a  nonlinear  architecture  for  extending  the  MACE  filter.  We 
discuss  some  its  properties.  Appropriate  measures  of  generalization  are  discussed.  We  also 
present  a  statistical  viewpoint  of  distortion  invariant  filters  from  which  such  nonlinear 
extensions  fit  naturally  into  an  iterative  framework.  From  this  iterative  framework  we 

52 


53 
present  experimental  results  which  exhibit  improved  discrimination  and  generalization 
performance  with  respect  to  the  MACE  filter  while  maintaining  the  properties  of  localized 
detection  peak  and  low  variance  in  the  output  plane. 

4.2  A  Proposed  Nnnlinp.ar  Arrhilp.rtiire 
.  As  we  have  stated,  the  MACE  filter  can  be  decomposed  as  a  pre-whitening  filter  fol- 
lowed by  a  synthetic  discriminant  function  (SDF),  which  can  also  be  viewed  as  a  special 
case  of  Kohonen's  linear  associative  memory  (LAM)  [Hester  and  Casasent,  1980;  Fisher 
and  Principe,  1994].  This  decomposition  is  shown  at  the  top  of  figure  22.  The  nonlinear 
filler  architecture  with  which  we  are  proposing  is  shown  in  the  middle  of  figure  22.  In  this 
architecture  we  replace  the  LAM  with  a  nonlinear  associative  memory,  specifically  a  feed- 
forward multi-layer  perceptron  (MLP),  shown  in  more  detail  at  the  bottom  of  figure  22. 
We  will  refer  to  this  structure  as  the  nonlinear  MACE  filter  (NL-MACE)  for  brevity. 

Another  reason  for  choosing  the  multi-layer  perceptron  (MLP)  is  that  it  is  capable  of 
achieving  a  much  wider  range  of  discriminant  functions.  It  is  well  known  that  an  MLP 
with  a  single  hidden  layer  can  approximate  any  discriminant  function  to  any  arbitrary 
degree  of  precision  [Funahashi,  1989].  One  of  the  shortcomings  of  distortion  invariant 
approaches  such  as  the  MACE  filter  is  that  it  attempts  to  fit  a  hyper-plane  to  our  training 
exemplars  as  the  discriminant  function.  Using  an  MLP  in  place  of  the  LAM  relaxes  this 
constraint.  MLPs  do  not,  in  general,  allow  for  analytic  solutions.  We  can,  however,  deter- 
mine their  parameters  iteratively  using  gradient  search.  , 


54 


xe  9? 


y  =  Ax 


pre-processor 


h  =  y(y^y)    d 


SDF/I.AM 


linear  filter 


™W,  x/V, 
xeSi  '     1 


pre-processed 
image 


Figure  22.  Decomposition  of  optimized  correlator  as  a  pre-processor  followed  by 
SDF/LAM  (top).  Nonlinear  variation  shown  with  MLP  replacing  SDF 
in  signal  flow  (middle),  detail  of  the  MLP  (bottom).  The  linear 
transformation   A    represents  the  space  domain   equivalent  of  the 


spectral  pre-processor  (aP,  +  ( 1  -  a)P^) 


I 


4.';..1    Shift  TnvarianfiP  nf  thp.  PropnsRH  NnnliriRar  ArchitBCtiire 
i-  One  of  the  properties  of  the  MACE  filter  is  shift  invariance.  We  wish  to  maintain  that 

'  ■■         property  in  our  nonlinear  extensions.  A  transformation,  T[   ] ,  of  a  two-dimensional  func- 
tion is  shift  invariant  if  it  can  be  shown  that 

g(«l,«2)  =  n:y("i."2)l 

g(n,+n,',n2  +  «2')  =  n>'(ni+n,',«2  +  "2')]' 
■^    -r   .   where  «,,«[',  n-^,  n{  are  integers.  In  other  words,  a  shift  of  the  input  signal  is  reflected  as 
a  corresponding  shift  of  the  output  signal.  [Oppenheim  ana  Shafer,  1989] 
T^_     ■  We  show  here  that  this  property  is  maintained  for  our  proposed  nonlinear  architecture. 

The  pre-processor  of  the  nonlinear  architecture  at  the  bottom  of  figure  22  is  the  same  as 
the  pre  processor  of  the  linear  filter  shown  at  the  top.  The  pre-processor  is  implemented  as 
a  linear  shift  invariant  (LSI)  filter  Cascading  shift  invariant  operations  maintains  shift 
invariance  of  the  entire  system  [Oppenheim  and  Shafer,  1989].  In  order  to  show  that  the 
system  as  a  whole  is  shift  invariant,  it  is  sufficient  to  show  that  the  MLP  is  shift  invariant. 
The  mapping  function  of  the  MLP  in  figure  22  can  be  written 


CO  =  .j  W  e  gj^-'X"''^^  ../   -  o.^.^x"..  „,   ,  c^'V-.xA'w  -A-.^xi  I  ■  '"'> 


.W^e  9?""'"""',  Wje  ■K""'"""',  ip  e  9?" 

In  the  nonlinear  architecture,  the  matrix  W,,  represents  the  connectivities  from  the  pro- 
cessing elements  (PEs)  of  layer  (i  -  1 )  to  the  input  to  the  PEs  of  layer  ;' ;  that  is,  the  matrix 
W-  is  applied  as  linear  transformation  to  the  vector  output  of  layer  ( i  -  1 ) .  When  /  =  1 
the  transformation  is  applied  to  the  input  vector,  y ,  The  number  of  PEs  in  layer  i  is 


56 
denoted  by  A/^  .  In  equation  41  ip  is  a  constant  bias  vector  added  to  each  element  of  the 
vector,  yV^oiV/^y)  e  9?'^-^ "' .  It  is  also  assumed  that  if  the  argument  to  the  nonlinear 

function  a(   )  is  a  matrix  or  vector  then  the  nonlinearity  is  applied  to  each  element  of  the 
matrix  or  vector. 

The  input  to  the  MLP  is  denoted  as  a  vector,  >>  e  9?  '  ^  .  The  elements  of  the  vector 
are  samples  of  a  two-dimensional  pre-whitened  input  signal,  y(n,,  n.^  .  We  can  write  the 
1  th  element  of  the  vector  as  a  function  of  the  two  dimensional  signal  as  follows 

3',("|.n2)  =  y(ni  +  <'.A'i>>n2  +  L''-'^lJ'  '  =  0,  .,., /ViA'j  -  1 . 

where  (i,  A^,)  indicates  a  modulo  operation  (the  remainder  of  /  divided  by  Ny )  and 
!_/,  /V,  J  indicates  integer  division  of  /  by  A^,  .  Written  this  way,  the  elements  of  the  vector 
y  sample  a  rectangular  region  of  support  of  size  A'l  xA^j  beginning  at  sample  {n^,n-^)  in 
the  pre-whitened  signal,  y{n^,  rij) .  The  vector  argument  of  equation  41  and  the  resulting 
output  signal  can  now  be  written  as  an  explicit  function  of  the  beginning  sample  point  of 
the  template  within  the  pre-whitened  image 

«<,("!.  "2)  =  «(«.J'(«i>n2))  =  a(W3<J(W2<^(Wi^(«i.  "2)) +  >?))■  (42) 

The  output  of  the  mapping  as  written  in  equation  42  is  now  an  explicit  function  of 
(n,,  n,)  and  the  constant  parameter  set,  O)  (which  do  not  vary  with  (n,,  n^)  )•  We  can  also 
write  the  output  response  as  a  function  of  the  shifted  version  of  the  image,  y(n,,  n^)  as 

g„(n,  +n{,n2  +  n2)  =  g((i),y{n,  -H  n,',  n^ -H  rij')) .  (43) 


% 


■(il^'f       .      ■   .-V--,-  •        ■-.,■,.'       • 

57 

Since  the  parameters,  CO ,  are  constant,  equations  42  and  43  are  sufficient  to  show  the 
mapping  of  the  MLP  is  shift  invariant  and  consequently,  the  system  as  a  whole  (including 
the  shift  invariant  pre-processor)  is  also  shift  invariant. 

4.3  Classifier  Perfnrmnnee  and  Measures  of  neneraliyation 
One  of  the  issues  for  any  iterative  method  which  relies  on  exemplars  is  the  number  of 
training  exemplars  to  use  in  the  computation  of  the  discriminant  function.  In  addition,  for 
iterative  methods,  there  is  the  issue  of  when  to  stop  the  adaptation  process.  In  the  case  of 
distortion  invariant  filters,  such  as  the  MACE  filter,  some  common  heuristics  are  used  to 
determine  the  number  of  training  exemplars.  Typically  samples  are  drawn  from  the  train- 
ing set  and  used  to  compute  the  filter  from  equation  23  until  the  minimum  peak  response 
over  the  remaining  samples  exceeds  some  threshold  [Casasent  and  Ravichandran,  1992]. 
A  similar  heuristic  is  to  continue  to  draw  samples  from  the  training  set  until  the  mean 
square  error  of  the  peak  response  over  the  remaining  samples  drops  below  some  preset 
threshold.  These  measures  are  then  used  as  indicators  of  how  well  the  filter  generalizes  to 
between  aspect  exemplars  from  the  training  set  which  have  not  been  used  for  the  computa- 
tion of  the  filter  coefficients. 

The  ultimate  goal,  however,  is  classification.  Generalization  in  the  context  of  classifi- 
cation must  be  related  to  the  ability  to  classify  a  previously  unseen  input  [Bishop,  1995]. 
We  show  by  example  that  the  measures  of  generalization  mentioned  above  may  be  mis- 
leading as  predictors  of  classifier  performance  for  even  the  linear  filters.  In  fact  the  result 
of  the  experiments  will  show  that  the  way  in  which  the  data  is  pre-processed  is  more  indic- 
ative of  classifier  performance  than  these  other  indirect  measures. 


58 

We  illustrate  this  point  witii  an  example  using  ISAR  image  data.  A  data  set,  larger  than 
in  the  previous  experiments,  will  be  used.  Two  more  vehicles,  one  from  each  vehicle  type 
will  be  used  for  the  testing  set,  and  all  vehicles  will  be  samples  at  higher  aspect  resolution. 
Figure  23  shows  ISAR  images  of  size  64  x  64  taken  from  five  different  vehicles  and  two 
different  vehicle  types.  The  images  are  all  taken  with  the  same  radar.  Data  taken  from 
vehicles  in  the  same  class  vary  in  the  vehicle  configuration  and  radar  depression  angle  (15 
or  20  degrees  depression).  Images  have  been  formed  from  each  vehicle  at  aspect  varia- 
tions of  0. 1 25  degrees  from  5  to  85  degrees  aspect  for  a  total  of  64 1  images  for  each  vehi- 
cle. Figure  23  shows  each  of  the  vehicles  at  5,  45,  and  85  degrees  aspect. 

We  will  use  vehicle  type  1  as  the  recognition  class  and  vehicle  type  2  as  a  confusion 
vehicle.  Images  of  vehicle  la  will  be  used  as  the  set  from  which  to  draw  training  exem- 
plars. Classification  performance  will  then  be  measured  as  the  ability  to  recognize  vehi- 
cles lb  and  Ic  while  rejecting  vehicles  2a  and  2b.  The  filter  we  will  use  is  a  form  of  the 
OTSDF  [Refregier  and  Figue,  1991]  which  is  computed  in  the  spectral  domain  as 


H  =  [aP^  +  {l-a)PJ  'x[XflaP^  +  {l-a)PJ  'x]d,  (44) 

where  the  columns  of  the  data  matrix  X  e  C:  '  '      '  are  the  Fourier  coefficients  of  N, 
exemplar  images  of  dimension  N^  x  Wj  of  vehicle  la  reordered  into  column  vectors.  The 

diagonal  matrix  P^  e  SR   '  '      '  ^  contains  the  coefficients  of  the  average  power  spec- 

trum  measured  over  the  N^  exemplars  of  vehicle  la,  while  P^e  vi      '  is  the  iden- 

N  X  1 

tity  matrix  scaled  by  the  average  of  the  diagonal  terms  of  P^ .  Finally,  rf  e  95  '       is  a 
column  vector  of  desired  outputs,  one  for  each  exemplar.  The  elements  of  d  are  typically 


"'*?.. 


59 


vehicle  I 


Figure  23.  ISAR  images  of  two  vehicle  types  shown  at  aspect  angles  of  5, 45,  and 
85  degrees  respectively.  Three  different  vehicles  of  type  1  (a,  b,  and  c) 
are  shown,  while  two  different  vehicles  of  type  2  (a  and  b)  are  shown. 
Vehicle  la  is  used  as  a  training  vehicle,  while  vehicles  lb  and  Ic  are 
used  as  the  testing  vehicles  for  the  recognition  class.  Vehicles  2a  and 
2b  are  u.sed  a  s  confusion  vehicles. 

set  to  unity.  When  a  is  set  to  unity  equation  44  yields  exactly  the  MACE  filter,  when  it  is 
set  to  zero  the  result  is  the  SDF.  The  filter  we  are  using  is  therefore  trading  off  the  MACE 
filter  criterion  with  the  SDF  criterion.  The  SDF  criterion  can  also  be  viewed  as  the 
MVSDF  [Kumar,  1986]  criterion  when  the  noise  class  is  represented  by  a  white  noise  ran- 
dom process.  This  filter  can  also  be  decomposed  as  in  figure  22. 


60 
These  experiments  examine  tlie  relationship  between  the  two  commonly  used  mea- 
sures of  generalization  and  two  measures  of  classification  performance.  We  can  draw  con- 
clusions from  the  results  about  the  appropriateness  of  the  generalization  measures  with 
regards  to  classification.  The  first  generalization  measure  is  the  minimum  peak  response, 
denoted  v^j^ ,  taken  over  the  aspect  range  of  the  images  of  the  training  vehicle  (excluding 
the  aspects  used  for  computing  the  filter).  The  second  generalization  measure  is  the  mean 
square  enor,  denoted  y^^^ ,  between  the  desired  output  of  unity  and  the  peak  response  over 
the  aspect  range  of  the  images  of  the  training  vehicle  (excluding  the  aspects  used  for  com- 
puting the  filter).  The  classification  measures  are  taken  from  the  receiver  operating  char- 
acteristic (ROC)  curve  measuring  the  probability  of  detecting,  P^,  a  testing  vehicle  in  the 
recognition  class  (vehicles  lb  and  Ic)  versus  the  probability  of  false  alarm,  P^^ ,  on  a  test- 
ing vehicle  in  the  confusion  class  (vehicles  2a  and  2b)  based  on  peak  detection.  The  spe- 
cific measures  are  the  area  under  the  ROC  curve,  a  general  measure  of  the  test  being  used, 
while  the  second  measure  is  the  probability  of  false  alarm  when  the  probability  of  detec- 
tion equals  80%,  which  measures  a  single  point  of  interest  on  the  ROC  curve. 

Two  filters  are  used,  one  with  a  =  0.5  and  the  other  with  a  -  0.95  ,  or  one  in  which 
both  criterion  are  weighted  equally  and  one  which  is  close  to  the  MACE  filter  criterion. 
The  number  of  exemplars  drawn  from  the  training  vehicle  (la)  is  varied  from  21  to  81 
sampled  uniformly  in  aspect  (1  to  4  degrees  aspect  separation  between  exemplars). 

Examination  of  figures  24  and  25  show  that  for  both  cases  (a  equal  to  0.5  and  0.95) 
no  clear  relationship  emerges  in  which  the  generalization  measures  are  indicators  of  good 
classification  performance.  Table  1  compares  the  classifier  performance  when  the  general- 


ization  measures  as  described  above  are  used  to  choose  the  filter  versus  the  best  ROC  per- 
formance achieved  throughout  the  range  of  aspect  separation.  In  one  regard,  the 
generalization  measures  were  consistent  in  that  the  same  aspect  separation  was  predicted 
by  both  measures  for  both  settings  of  a .  In  figure  26  we  compare  the  ROC  curves  for  two 
cases,  first  where  the  filter  chosen  using  the  generalization  measures  and  second  the  best 
achieved  ROC  curve,  for  both  settings  of  a .  We  would  expect  that  for  each  a  the  filter 
using  the  generalization  measure  would  be  near  the  best  ROC  performance.  As  can  be 
seen  in  the  figure  this  is  not  the  case. 

Table  1.  Cla-ssifier  performance  measures  when  the  filter  is  determined  by  either  of  the 
common  measures  of  generalization  as  compared  to  best  classifier  performance  for  two 
values  of  a . 


Generalization  Measure 

ymin                    ymse 

Best 

a  : 

=  0.50 

Pf3@P„=0.8 
ROC  area 

0.24 
0.83 

0.24 
0.83 

0.16 
0.90 

a  = 

=  0.95 

Pf3@P<,=0.8 
ROC  area 

0.16 
0.94 

0.16 
0.94 

0.07 
0.95 

It  is  obvious  from  figures  24  and  25  that  the  generalization  measures  are  not  signifi- 
cantly correlated  with  the  ROC  performance.  In  fact,  as  summarized  in  table  2,  the  gener- 
alization measures  are  negatively,  albeit  weakly,  correlated  with  ROC  performance.  One 
feature  of  figures  24  and  25  is  that  although  ROC  performance  varies  independent  of 


62 


1.00 


0.95 


0.90 


o 
a: 


0.85 


0.80 


y  i„  vs.  ROC  a rea 


-i  °K.^^ 


n 

'ma 


^  a  =     0.50 
a  a  =     0.95 


Ai 
-^  A 


A  A 


0.60 


0.70 


0.30 


0.20 


oT  0.15 


0.10 


0.05 


0.00 


0.90 

y^,„  VS.  P,„  (@  P„  =  0.8) 


0.80 


A 


D    "   A 


A  a   =     0.50 
a  a   =     0.95 


) 


0.60 


0.70 


0.80 


0.90 


1.00 


1.00 


Figure  24.  Generalization  as  measured  by  the  minimum  peak  response.  The  plot 
compares  y^^^  versus  classification  performance  measures  (ROC  area 
and  Pfj@Pd=0.8). 


63 


o 
OS 


VS. 

ROC  area 

1.00 

-1 

0.95 

- 

□ 

'-'a 

L°B^^°^° 

.=°°     ] 

0.90 

-^^A 

D 

D 

■      - 

0.85 

:f 

A   a 

=     0.50           J 
=     0.95 

0.80 

-J — 1- 

- 

0.000        0.010        0.020        0.030        0.040        0.050        0.060        0.070 


J^n^si^ 

vs 

.  P,,  (@  Ph-0.8) 

0.30 

0.25 

:-M 

J 

0.20 

% 

- 

0.15 
0.10 

;  A  A 

D 

n 

□%  ^  tap     □    ° 

n    D        - 

n           : 

0.05 

- 

A  a 

=     0.50 

=     0.95           ; 

0.00 

"  1  1  .  .  1 

1 

1           _    1 

1    1    i    1    1    .    1 

0.000        0.010        0.020        0.030        0.040        0.050        0.060        0.070 

Figure  25.  Generalization  as  measured  by  the  peak  response  mean  square  error. 
The  plot  compares  y^^^  versus  classification  peiformance  measures 
(ROC  area  and  Pf3@Pd=0.8). 


64 


0.50,  best  generalization 

0.50,  best  ROC 

0.95,  best  generalization 

0.95,  best  ROC 


0.0 


0.2 


0.4 


0.6 


0.8 


1.0 


Figure  26.  Comparison  of  ROC  curves.  The  ROC  curves  for  the  number  of 
training  exemplars  yielding  the  best  generalization  measure  versus  the 
number  yielding  the  best  ROC  performance  for  values  of  a  equal  to 
0.5  and  0.95  are  shown. 

either  the  minimum  peak  response  or  the  MSE,  there  does  appear  to  be  dependency  on  a . 

This  leads  to  a  second  experiment. 

Table  2.  Correlation  of  generalization  measures  to  classifier  performance.  In  both  cases  (Ot  equal  to  0.5 
or  0.95)  the  classifier  performance  as  measured  by  the  area  of  the  ROC  curve  or  Pf,,  at  Pj  equal  0.8,  has 
an  opposite  correlation  as  to  what  would  be  expected  of  a  useful  measure  for  predicting  performance. 

Performance  Measures 

ROC  area     Pfa(@Pd=0.8)     ROC  area     Pfa(@Pj=0.8) 


Generalization     ymin 
Measures         v 


a  =  0.50 
-0.39  0.21 

0.32  -0.11 


a  =  0.95 
-0.40  0.41 

0.31  -0.35 


In  the  second  experiment  we  examine  the  relationship  between  the  parameter  a  and 
the  ROC  performance.  The  aspect  separation  between  training  exemplars  is  set  to  2,  4,  and 
8  degrees.  The  value  of  a,  the  emphasis  on  the  MACE  criterion,  is  varied  in  the  range 
zero  to  unity.  Figure  27  shows  the  relationship  between  ROC  performance  and  the  value 


65 

of  a .  It  is  clear  from  the  plots  that  there  is  a  positive  relationship  between  the  emphasis  on 
the  MACE  criteria  and  the  ROC  performance.  However,  the  peak  in  ROC  performance  is 
not  achieved  at  a  equal  to  unity.  In  all  three  cases,  the  ROC  performance  peaks  just  prior 
to  unity  with  the  performance  drop-off  increasing  with  aspect  separation  at  a  equal  to 
unity. 

The.  difference  between  the  SDF  and  MACE  filter  is  the  pre-processor.  What  is  shown 
by  this  analysis  is  that,  in  general,  the  pre-processor  from  the  MACE  filter  criterion  leads 
to  better  classification,  but  too  much  emphasis  on  the  MACE  filter  criterion,  as  measured 
by  a  equal  to  unity,  leads  to  a  filter  which  is  too  specific  to  the  training  samples.  The 
problems  described  above  are  well  known.  Alterations  to  the  MACE  criterion  have  been 
the  subject  of  many  researchers  [Casasent  et  al.,  1991;  Casasent  and  Ravichandran,  1992; 
Ravichandran  and  Casasent,  1992;  Mahalanobis  et  al.,  1994a].  There  is  still,  as  yet,  no 
principled  method  found  in  the  literature  by  which  to  set  the  parameter  a . 

There  are  two  conclusions  from  this  analysis  that  are  pertinent  to  the  nonlinear  exten- 
sion we  are  using.  First  the  results  show  that  pre-whitening  over  the  recognition  class 
leads  to  better  classification  performance.  For  this  reason  we  choose  to  use  the  pre-proces- 
sor of  the  MACE  filter  in  our  nonlinear  filter  architecture.  The  issue  of  extending  the 
MACE  filter  to  nonlinear  systems  can  in  this  way  be  formulated  as  a  search  for  a  more 
robust  nonlinear  discriminant  function  in  the  pre-whitened  image  space. 

The  second  conclusion  is  that  comparisons  of  the  nonlinear  filter  to  its  linear  counter- 
part must  be  made  in  terms  of  classification  performance  only.  There  are  simple  nonlinear 
systems,  such  as  a  soft  threshold  at  the  output  of  a  linear  system  for  example,  that  will  out- 


66 


1.0 
0.8 

ROC 

area 

VS.  a 

- 

o 

. 

g 

8 

a 

g        g  arasi        - 

(B 

0.6 

:    g 

- 

; 

o 

0.4 

o.a 

nn 

■ 

3    ,     , 

1 

1 

o  « 

A   « 
D   « 

.1 

=     2.00  degrees 
=     4.00  degrees 
=     8.00  degrees 

0.0 


0.0 


0.2 


0.4 


0.8 


0.8 


1.0 


Pfa 

(@P,: 

=  0.8) 

vs.    Q 

1.0 

- 

0.8 

- 

O   a   =     2.00  degrees 
^   a   =     4.00  degrees 
Q  a   =     8.00  degrees 

- 

0.6 

n 

A 

o 

— 

0.4 

- 

H 

- 

0  2 

11  n 

- 

o 

o 

H 
o 

o 

0            B            n 

o        o        §   °qiiii 
4> 

[ 

0.2 


0.4 


0.6 


0.8 


1.0 


Figure  27.  ROC  performance  measures  versus  a.  Results  are  shown  for  training 
aspect  separations  of  2,  4,  and  8  degrees.  These  plots  indicate  that 
ROC  performance  is  positively  related  to  a . 

perform  the  MACE  filter  or  its  variations  in  terms  of  maximizing  the  minimum  peak 

response  over  the  training  vehicle  or  reducing  the  variance  in  the  output  image  plane. 


67 
These  measures  are  not,  however,  sufficient  to  describe  classification  performance.  We 
have  also  used  these  measures  in  the  past  but  feel  that  they  are  not  the  most  appropriate  for 
classification  [Fisher  and  Principe,  1995b]. 

4  i  .Starktiral  rhnrarteri/atinn  of  the  Rejection  Class  '.  , 

We  now  present  a  statistical  viewpoint  of  distortion  invariant  filters  from  which  such 
nonlinear  extensions  fit  naturally  into  an  iterative  framework.  This  treatment  results  in  an 
efficient  way  to  capture  the  optimality  condition  of  the  MACE  filter  using  a  training  algo- 
rithm which  is  approximately  of  order  N,  and  which  leads  to  better  classification  perfor- 
mance than  the  linear  MACE. 

A  possible  approach  to  design  a  nonlinear  extension  to  the  MACE  filler  and  improve 
on  the  generalization  properties  is  to  simply  substitute  the  linear  processing  elements  of 
the  LAM  with  nonlinear  elements.  Since  such  a  system  can  be  trained  with  error  back- 
propagation  [Rumelhart  et  al..  1986|,  the  issue  would  be  simply  to  report  on  performance 
comparisons  with  the  MACE.  Such  methodology  does  not,  however,  lead  to  understand- 
ing of  the  role  of  the  nonlinearity,  and  does  not  elucidate  the  trade-offs  in  the  design  and  in 
training. 

Here  we  approach  the  problem  from  a  different  perspective.  We  seek  to  extend  the 
optimality  condition  of  the  MACE  to  a  nonlinear  system,  i.e.  the  energy  in  the  output 
space  is  minimized  while  maintaining  the  peak  constraint  at  the  origin.  Hence  we  will 
impose  these  constraints  directly  in  the  formulation,  even  knowing  a  priori  that  an  analyti- 
cal solution  is  very  difficult  or  impossible  to  obtain.  We  reformulate  the  MACE  filter  from 


68 

a  statistical  viewpoint  and  generalize  it  to  arbitrary  mapping  functions,  linear  and  nonlin- 
ear. 

Consider  images  of  dimension  A',  x  N2  re-ordered  by  column  or  row  into  vectors.  Let 

the  rejection  class  be  characterized  by  the  random  vector.  A",  e  9?  '  ^^  .We  know  the 
second-order  statistics  of  this  class  as  represented  by  the  average  power  spectrum  (or 
equivalently  the  autocorrelation  function).  Let  the  recognition  class  be  characterized  by 

thecolumnsofadatamatrix  ATjE  9^   '  "      '  which  are  observations  of  the  random  vector, 

A.2  G  yt  ,  similarly  re-ordered.  We  wish  to  find  the  parameters,  (O,  of  a  mapping, 

giay,  X'):9?  '  '  -»  9{  such  that  we  may  discriminate  the  recognition  class  from  the 
rejection  class.  Here,  it  is  the  mapping  function,  g ,  which  defines  the  discriminator  topol- 
ogy. 

Towards  this  goal,  we  wish  to  minimize  the  objective  function 


y  =  Efe((B,X,)^) 


over  the  mapping  parameters,  O) ,  subject  to  the  system  of  constraints 


g((0,  X2)  =  d   .  •  ■  (45) 

N,  X  1 

Where  d  e  J?  is  a  column  vector  of  desired  outputs.  It  is  assumed  that  the  mapping 

function  is  applied  to  each  column  of  jCj ,  and  E(  )  is  the  expected  value  function. 


69 
Using  the  method  of  Lagrange  multipliers,  we  can  augment  the  objective  function  as 


J  =  E(g(03,^,)')  +  (g(a),X2)-</'')X,  (46) 

wnere  A  £  y{  is  a  vector  whose  elements  are  the  Lagrange  mukipliers,  one  for  each 

constraint.  Computing  the  gradient  with  respect  to  the  mapping  parameters  yields 


3S  =  2E[g(m.X,)[-^^]y^^X.  (47) 

Equation  47  along  with  the  constraints  of  equation  45  can  be  used  to  solve  for  the  opti- 
mal parameters,  co  ,  assuming  our  constraints  form  a  consistent  set  of  equations.  This  is, 
of  course  dependent  on  the  mapping  topology. 

4.4.1   The  Linear  Sohitinn  as  a  .Special  Taw 

It  is  interesting  to  verify  that  this  formulation  yields  the  MACE  filter  as  a  special  case. 
If,  for  example,  we  choose  the  mapping  to  be  a  linear  projection  of  the  input  image,  that  is 


equation  46  becomes,  after  simplification. 


/  =  mE{X^x])(ii  +  ((ii^x2-d^)X.  (48) 

In  order  to  solve  for  the  mapping  parameters,  co ,  we  are  still  left  with  the  task  of  com- 

putmg  the  term  E{X^X^ )  which,  in  general,  we  can  only  estimate  from  observations  of  the 

random  vector,  ^| ,  or  assume  a  specific  form.  Assuming  that  we  have  a  suitable  estima- 


70 
tor,  the  well  known  solution  to  the  minimum  of  equation  48  over  the  mapping  parameters 
subject  to  the  constraints  of  equation  45  is  . 


0)  =  R'xx^ixlkxfyf  d,  (49) 


where 


Rx\  =  estimate  {£:(jr  I  ;if|)}.  (50) 

Depending  on  the  characterization  of  AT, ,  equation  49  describes  various  SDF-type  fil- 
ters (i.e.  MACE,  MVSDF.  etc.).  In  the  case  of  the  MACE  filter,  the  rejection  class  is  char- 
acterized by  all  2D  circular  shifts  of  target  class  images  away  from  the  origin.  Solving  for 
the  MACE  filter  coefficients  is  therefore  equivalent  to  using  the  average  circular  autocor- 
relation sequence  (or  equivalently  the  average  power  spectrum  in  the  frequency  domain) 

T 

over  images  m  the  target  class  as  estimators  of  the  elements  of  the  matrix  E{X^Xf). 
Sudharsanan  et  al  [1991]  suggest  a  very  similar  methodology  for  improving  the  perfor- 
mance of  the  MACE  filter.  In  that  case  the  average  linear  autocorrelation  sequence  is  esti- 

T 

mated  over  the  target  class  and  this  estimator  of  £(A'|A'|)  is  used  to  solve  for  linear 

projection  coefficients  in  the  space  domain.  The  resulting  filter  is  referred  to  as  the 
SMACE  (space-domain  MACE)  filter. 

4.4.7.  Nonlinear  Mappings 

For  arbitrary  nonlinear  mappings  it  will,  in  genera!,  be  very  difficult  to  solve  for  glo- 
bally optimal  parameters  analytically.  Our  purpose  is  instead  to  develop  iterative  training 
algorithms  which  are  practical  and  yield  improved  performance  over  the  linear  mappings. 


71 
It  is  through  the  implicit  description  of  the  rejection  class  by  its  second-order  statistics 
from  which  we  have  developed  an  efficient  method  extending  the  MACE  filter  and  other 
related  correlators  to  nonlinear  topologies  such  as  neural  networks. 

As  stated,  our  goal  is  to  find  mappings,  defined  by  a  topology  and  a  parameter  set, 
which  improve  upon  the  performance  of  the  MACE  filter  in  terms  of  generalization  while 
maintaining  a  sharp  constrained  peak  in  the  center  of  the  output  plane  for  images  in  the 
recognition  class.  One  approach,  which  leads  to  an  iterative  algorithm,  is  to  approximate 
the  original  objective  function  of  equation  46  with  the  modified  objective  function 

J=  i^-m(g((i>,X,f)  +  !^\g{oy,X2)~d^][g{(o,X2)-d'^]^^  (51) 

The  principal  advantage  gained  by  using  equation  51  over  equation  46  is  that  we  can 
solve  iteratively  for  the  parameters  of  the  mapping  function  (assuming  it  is  differentiable) 
using  gradient  search.  The  constraint  equations,  however,  are  no  longer  satisfied  with 
equality  over  the  training  set.  It  has  been  recognized  that  the  choice  of  constraint  values 
has  direct  impact  on  the  performance  of  optimized  linear  correlators.  Sudharsanan  et  al 
[1990]  have  explored  techniques  for  optimally  assigning  these  values  within  the  con- 
straints of  a  linear  topology.  Other  methods  have  been  suggested  [Mahalanobis  et  al., 
1994a,  1994b;  Kumar  and  Mahalanobis,  1995]  to  improve  the  performance  of  distortion 
invariant  filters  by  relaxing  the  equality  constraints.  Mahalanobis  [1994a]  extends  this 
idea  to  unconstrained  linear  correlation  filters.  The  OTSDF  objective  function  of 
Refregier  [  1 99 1  ]  appears  similar  to  the  modified  objective  function  and  indeed,  for  a  lin- 
ear topology  this  can  be  solved  analytically  as  an  optimal  trade-off  problem. 


72 

Our  primary  purpose  for  modifying  the  objective  function  is  to  allow  for  an  iterative 
method  within  the  NL-MACE  architecture.  We  have  already  shown  in  the  previous  chap- 
ter that  this  choice  of  criterion  is  suitable  for  classification.  We  will  show  that  the  primary 
qualities  of  the  MACE  filter  are  still  maintained  when  we  relax  the  equality  constraints  in 
our  formulation.  Varying  (3  in  the  range  [0,  1]  controls  the  degree  to  which  the  average 
response  to  the  rejection  class  is  emphasized  versus  the  variance  about  the  desired  output 
over  the  recognition  class. 

As  in  the  linear  case,  we  can  only  estimate  the  expected  variance  of  the  output  due  to 
the  random  vector  input  and  its  associated  gradient.  If,  as  in  the  MACE  (or  SMACE)  filter 
formulation.  A",  is  characterized  by  all  2-D  circular  (or  linear)  shifts  of  the  recognition 
class  away  from  the  origin  then  this  term  can  be  estimated  with  a  sampled  average  over 
the  exemplars,  Xj ,  for  all  such  shifts.  From  an  iterative  standpoint  this  still  leads  to  the 
impractical  approach  training  exhaustively  over  the  entire  output  plane.  It  is  desirable, 
then,  to  find  other  equivalent  characterizations  of  the  rejection  class  which  may  alleviate 
the  computational  load  without  significantly  impacting  performance. 

4  5  F.fficient  Representation  of  the  Rejection  Class 
Training  becomes  an  issue  once  the  associative  memory  structure  takes  a  nonlinear 
form.  The  output  variance  of  the  linear  MACE  filter  is  minimized  for  the  entire  output 
plane  over  the  training  exemplars.  Even  when  the  coefficients  of  the  MACE  filter  are 
computed  iteratively  we  need  only  consider  the  output  point  at  the  designated  peak  loca- 
tion (constraint)  for  each  pre-whitened  training  exemplar  [Fisher  and  Principe,  1 994].  This 
is  due  to  the  fact  that  for  the  under-determined  case,  the  linear  projection  which  satisfies 


73 
the  system  of  constraints  with  equality  and  has  minimum  norm  is  also  the  linear  projection 
which  minimizes  the  response  to  images  with  a  flat  power  spectrum.  This  solution  is 
arrived  at  naturally  via  a  gradient  search  which  only  considers  the  response  at  the  con- 
straint location. 

This  is  no  longer  the  case  when  the  mapping  is  nonlinear.  Adapting  the  parameters  via 
gradient  search  (such  as  error  backpropagation)  on  recognition  class  exemplars  only  at  the 
constraint  location  will  not,  in  general,  minimize  the  variance  over  the  entire  output  image 
plane.  In  order  to  minimize  the  variance  over  the  entire  output  plane  we  must  consider  the 
response  of  the  filter  to  each  location  in  the  input  image,  not  just  the  constraint  location. 

The  MACE  filter  optimization  criterion  minimizes,  in  the  average,  the  response  to  all 
images  with  the  same  second  order  statistics  as  the  rejection  class.  At  the  output  of  the  pre- 
whitener  (prior  to  the  MLP)  any  white  sequence  will  have  the  same  second  order  statistics 
as  the  rejection  class.  This  condition  can  be  exploited  to  make  the  training  of  the  MLP 
more  efficient. 

From  an  implementation  standpoint,  the  pre-whitening  stage  and  the  input  layer 
weights  can  be  combined  into  a  single  equivalent  linear  transformation,  however,  pre- 
whitening  separately  allows  the  rejection  class  to  be  represented  by  white  sequences  at  the 
input  to  the  MLP  during  the  training  phase. 

This  result  is  due  to  the  statistical  formulation  of  the  optimization  criterion.  Minimiz- 
ing the  response  to  white  sequences,  in  the  average,  minimizes  the  response  to  shifts  of  the 
exemplar  images  since  they  have  the  same  second-order  statistics  (after  pre-whitening). 
Consequently,  we  do  not  have  to  train  over  the  entire  output  plane  exhaustively,  thereby 
reducing  training  times  proportionally  by  the  input  image  size,  N^N2-  Instead,  we  use  a 


74 
small  number  of  randomly  generated  white  sequences  to  efficiently  represent  the  rejection 
class.  The  result  is  an  algorithm  which  is  of  order  N,  +  N^^  (where  N^^  is  the  number  of 
white  noise  rejection  class  exemplars)  as  compared  to  exhaustive  training. 

4.6  F.xpcrimental  Rp-;iilfQ 

We  now  present  experimental  results  which  illustrate  the  technique  and  potential  pit- 
falls. There  are  four  significant  outcomes  in  the  experiments  presented  in  this  section.  The 
first  is  that  when  using  the  white  sequences  to  characterize  the  rejection  class,  the  linear 
solution  is  a  strong  attractor.  The  second  outcome  is  that  imposing  orthogonality  on  the 
,  input  layer  to  the  MLP  tends  to  lead  to  a  nonlinear  solution  with  improved  performance. 
The  third  result,  in  which  we  restrict  the  rejection  class  to  a  subspace,  yields  a  significant 
decrease  in  the  convergence  time.  The  fourth  result,  in  which  we  borrow  from  the  idea  of 
using  the  interior  of  the  convex  hull  to  represent  the  rejection  class  [Kumar  et  al.,  1994], 
yields  significantly  better  classification  performance. 

In  these  experiments  we  use  the  data  depicted  in  figure  23.  As  in  the  previous  experi- 
ments images  from  vehicle  la  will  be  used  as  the  training  set.  Vehicles  lb  and  Ic  will  be 
used  as  the  recognition  class  while  vehicles  2a  and  2b  will  be  used  as  a  rejection/confusion 
class  for  testing  purposes.  In  each  case  comparisons  will  be  made  to  a  baseline  linear  filter. 
Specifically,  in  all  cases  the  value  of  a  for  the  linear  filter  is  set  to  0.99.  The  aspect 
separation  between  training  images  is  2.0  degrees.  This  results  in  41  training  exemplars 
from  vehicle  la.  These  settings  of  a  and  aspect  separation  were  found  to  give  the  best 
classifier  performance  for  the  linear  filter  with  this  data  set.  We  continue  to  refer  to  this  as 
a  MACE  filter  since  the  MACE  criterion  is  so  heavily  emphasized.  Technically  it  is  an 


75 
OTSDF  filter,  but  such  nomenclature  does  not  convey  the  type  of  pre-processing  that  is 
being  performed.  We  choose  the  value  of  a  so  as  to  compare  to  the  best  possible  MACE 
filter  for  this  data  set. 

The  nonlinear  filter  will  use  the  same  pre-processor  as  the  linear  filter  (i.e.  a  =  0.99 ). 
The  MLP  structure  is  shown  at  the  bottom  of  figure  22.  It  accepts  an  NfN.^  input  vector  (a 
preprocessed  image  reordered  into  a  column  vector),  followed  by  two  hidden  layers  (with 
two  and  three  hidden  PE  nodes,  respectively),  and  a  single  output  node.  The  parameters  of 
the  MLP 

are  to  be  determined  through  gradient  search.  The  gradient  search  technique  used  in  all 
cases  will  be  error  backpropagation  algorithm. 

4.6. 1    Experiment  T  -  noise  training 

As  stated,  using  the  statistical  approach,  the  rejection  class  is  characterized  by  white 
noise  sequences  at  the  input  to  the  MLP.  The  recognition  class  is  characterized  by  the 
exemplars.  It  is  from  these  white  noise  sequences  that  the  MLP,  through  the  backpropaga- 
tion learning  algorithm,  captures  information  about  the  rejection  class.  So  it  would  seem  a 
simple  matter,  during  the  training  stage,  to  present  random  white  noise  sequences  as  the 
rejection  class  exemplars.  This  is  exactly  the  training  method  used  for  this  experiment. 
From  our  empirical  observations  we  observed  that  with  this  method  of  training  the  linear 
solution  is  a  strong  attractor  The  results  of  the  first  experiment  is  demonstrates  this  behav- 


76 
Figure  28  shows  the  peak  output  response  taken  over  all  images  of  vehicle  la  for  both 
the  linear  (top)  and  nonlinear  (bottom)  filters.  In  the  figure  we  see  that  for  the  linear  filter 
the  peak  constraint  (unity)  is  met  exactly  for  the  training  exemplars  with  degradation  for 
the  between  aspect  exemplars.  As  mentioned  previously,  if  the  pure  MACE  filter  criterion 
were  used  (a  equal  to  unity),  the  peak  in  the  output  plane  is  guaranteed  to  be  at  the  con- 
straint location  [Mahalanobis  et  al.,  1 987].  It  turns  out  that  for  this  data  set  the  peak  output 
also  occurs  the  constraint  location  for  the  training  images,  however,  with  a  =  0.99  it  was 
not  guaranteed.  Examination  of  the  peak  output  response  for  the  NL-MACE  filter  shows 
thai  the  constraints  are  met  very  closely  (but  not  exactly)  for  the  training  exemplars  also 
with  degradation  in  the  peak  output  response  at  between  aspect  locations.  The  degradation 
for  the  nonlinear  filter  is  noticeably  less  than  in  the  linear  case  and  so  in  this  regard  it  has 
outperfonned  the  linear  filter 

Figure  29  shows  the  output  plane  response  for  a  single  image  of  vehicle  la  (not  one 
used  for  computing  the  filter  coefficients)  for  the  linear  filter  (top)  and  the  nonlinear  filter 
(bottom).  Again  in  this  figure  we  see  that  both  filters  result  in  a  noticeable  peak  when  the 
image  is  centered  on  the  filter  and  a  reduced  response  when  the  image  is  shifted.  The 
reduction  in  response  to  the  shifted  image  is  again  noticeably  better  in  the  nonlinear  filter 
than  in  the  linear  filter  Such  would  be  found  to  be  true  for  all  images  of  vehicle  la  and  so 
in  this  regard  we  can  again  say  that  the  nonlinear  filter  had  outperformed  the  linear  filter. 
However,  as  we  have  already  illustrated  for  the  linear  case,  these  measures  are  not  suf- 
ficient to  predict  classifier  performance  alone  and  are  certainly  not  sufficient  to  compare 
linear  systems  to  nonlinear  systems.  This  point  is  made  clear  in  table  3  which  summarizes 
the  classifier  performance  at  two  probabilities  of  detection  for  all  of  the  experiments 


77 


linear  filter 


1.10 


c    0.90 


0.70 


0.60 


40  60 

aspect  (degrees) 

nonlinear  filter 


40  60 

aspect  (degrees) 


80 


60 


1.10 

:                             '                              '                              '                       '        '        '      ': 

1.00 

^  WMf1/WlV"'^^W^Il/^AVlf^^ 

-: 

:    1  p»y '^  iirn'        "    y      rtn|      fy  rW 

y 

1        - 

n 

fc                                             '      1      ' 

1 

c    0.90 

1 

o. 

1 

: 

^ 

S     O.BO 

r                                                                                                                               ~ 

a. 

0.70 

:-                                                                  J 

0.60 

'  ,   .      ,            ,            ,            .      \ 

Figure  28.  Peak  output  response  of  linear  and  nonlinear  filters  over  the  training 
set.  The  nonlinear  filter  clearly  outperforms  the  linear  filter  by  this 
metric  alone. 

reported  here  when  vehicles  lb  and  Ic  are  used  as  the  recognition  class  and  vehicles  2a 

and  2b  are  used  for  the  rejection  class.  At  this  point  we  are  only  interested  in  the  results 

pertaining  to  the  linear  filter  (our  baseline)  and  nonlinear  filter  results  for  experiment  I. 


78 


Figure  29.  Output  response  of  linear  filter  (lop)  and  nonlinear  filter  (bottom). 
The  response  is  for  a  single  image  from  the  training  set,  but  not  one 
used  to  compute  the  filter. 

This  table  shows  that  the  classifier  performance  for  the  linear  filter  and  nonlinear  filters 
are  nominally  the  same,  despite  what  may  be  perceived  to  be  better  performance  in  the 
nonlinear  filter  with  regards  to  peak  response  over  the  training  vehicle  and  reduced  output 
plane  response  to  shifts  of  the  image.  Furthermore,  if  we  examine  figure  30,  which  shows 


79 

the  ROC  curve  for  both  filters  we  see  that  they  oveilay  each  other.  From  a  classification 
standpoint  the  two  filters  are  equivalent. 


Figure  30.  ROC  curves  for  linear  filter  (solid  line)  versus  nonlinear  filter  (dashed 
line).  Despite  improved  performance  of  the  nonlinear  filter  as 
measured  by  peak  output  response  and  reduced  variance  over  the 
training  set,  the  filters  are  equivalent  with  regards  to  classification 
over  the  testing  set. 


The  explanation  of  this  result  is  best  explained  by  figure  31.  Recall  the  points  «,  and 
M2  labeled  in  figure  22. 

We  can  view  these  outputs  as  a  feature  space,  that  is,  the  MLP  discriminant  function 
can  be  superimposed  on  the  projection  of  the  input  image  onto  this  space.  In  this  case  the 
feature  space  is  a  representation  of  the  input  vector  internal  to  the  MLP  structure.  The  des- 
ignation ofthe.se  points  as  features  is  due  to  the  fact  that  they  represent  some  abstract  qual- 


80 


1.0 
0.5 

\                    '                /-^'^ 

0.0 

-0.5 

■M"         D  recognition,  testing 

-1.0 

O  rejection,  testing 

-1.0 


-0.5 


0.0 


0.5 


1.0 


Figure  31.  Experiment  I:  Resulting  feature  space  from  simple  noise  training. 
Note  that  all  points  are  projected  onto  a  single  curve  in  the  feature 
space.  In  the  top  figure  squares  are  the  recognition  class  training 
exemplars,  triangles  are  white  noise  rejection  class  exemplars,  and 
plus  signs  are  the  images  of  vehicle  la  not  used  for  training.  In  the 
bottom  figure,  squares  are  the  peak  responses  from  vehicles  lb  and  Ic, 
triangles  are  the  peak  responses  from  vehicles  2a  and  2b. 


81 

ity  of  the  data  and  the  decision  surface  can  be  computed  as  a  function  of  the  features. 
Mathematically  this  can  be  written 


Wjx  = 


a(wla{wla(u)  +  (f)). 


(52) 


Recall  that  the  matrix  W,  represents  the  connectivities  from  the  output  of  layer  (i  -  1 )  to 

the  inputs  of  the  PEs  of  layer  i ,  (p  is  a  constant  bias  term,  and  c(   )  is  a  sigmoidal  nonlin- 
earity  (hyperbolic  tangent  function  in  this  case). 

Figure  3 1  shows  this  projection  for  the  training  set  (top)  and  the  testing  set  (bottom). 
What  is  significant  in  the  figure  is  that  although  the  discriminant  as  a  function  of  the  vec- 
tor u  is  nonlinear,  the  projection  of  the  images  lie  on  a  single  curve  in  this  feature  space. 
Topologically  this  filter  can  put  into  one-to-one  correspondence  with  a  linear  projection. 
This  is  not  to  say  that  the  linear  solution  is  undesirable,  but  under  the  optimization  crite- 
rion it  can  be  computed  in  closed  form.  Furthermore,  in  a  space  as  rich  as  the  ISAR  image 
space  it  is  unlikely  that  the  linear  solution  will  give  the  best  classification  performance. 

Table  3.  Comparison  of  ROC  classifier  performance  for  lo  values  of  Pj.  Results  are  shown  for  the  linear 
filler  versus  four  different  lypes  of  nonlinear  training.  N:  white  noise  training,  G-S:  Gram-Schmidt 
orthogonalizalion,  subN:  PCA  subspace  noise,  C-H:  convex  hull  rejection  class. 


Pd(%) 


80 
99 


linear  filter 


4.37 
42.43 


[(N) 


4.37 
41.87 


Pfa(%) 
nonlinear  filter,  experiments  I-IV 


n  (N,  G-S) 


3.74 
27.15 


III  (subN,  G-S) 


2.81 
26.52 


IV  (subN,  G-S,  C-H) 


2.45 
15.33 


4.6.2  Exnerimeni  II  -  noise  training  with  an  nrthognnali/atinn  constraint 

As  a  means  of  avoiding  the  linear  solution  a  modification  was  made  to  the  training 
algorithm.  The  modification  was  to  impose  orthogonality  on  the  columns  of  W,  through  a 


82 


Gram-Schmidt  process.  The  motivation  for  doing  this  stems  from  the  fact  that  we  are 
working  in  a  pre-whitened  image  space.  In  a  pre-whitened  image  space  this  condition  is 
sufficient  to  assure  the  outputs  in  the  feature  space,  as  measured  at  u^  and  «2.  will  be 
uncorrelated  over  the  rejection  class.  Mathematically  this  can  be  shown  as 


E{uu^}  =  E{wJx.xJw,} 


=  w]e{X,x]}Wi 


T    2  T    2 

(•"[(J  /if]  w^a  Iw2 

T    2,  T    2, 

W2O  /B'l   W2CS  IW2 


w]E(XJxJ)w^  ,v]e(X^x])W2 
wlE(X^x])Wt  wlE{X^x])w2 

-.If    0 

where  m-,,  1^2  s  9?   '  '       are  the  columns  of  W, .  This  result  is  true  for  any  number  of 
nodes  in  the  first  layer  of  the  MLP. 

The  results  of  the  training  with  this  modification  are  shown  in  figure  32  which  is  the 
resulting  feature  space  as  measured  at  «,  and  Uj .  From  this  figure  we  can  see  that  the  dis- 
criminant function,  represented  by  the  contour  lines,  is  a  nonlinear  function  of  «,  and  Uj  ■ 
Furthermore,  because  the  projection  of  the  vehicles  into  the  feature  space  do  not  lie  on  a 
single  curve  (as  in  the  previous  experiment),  the  features  represent  different  discrimina- 
tion information  with  regards  to  the  both  rejection  and  recognition  classes.  The  bottom  of 
the  figure,  showing  the  projection  of  a  random  sampling  of  the  test  vehicles  (all  1282 
would  be  too  dense  for  plotting)  show  that  both  features  are  useful  for  separating  vehicle  1 
from  vehicle  2.  Examination  of  table  3  (column  II  in  the  nonlinear  results)  shows  that  at 
the  two  detection  probabilities  of  interest  improved  false  alarm  performance  has  been 


83 


obtained.  Figure  33  shows  the  ROC  curve  for  the  resulting  fiher.  It  is  evident  that  the  non- 
linear fiher  is  a  uniformly  better  test  for  classification. 


1.0 


0.5 


0.0 


-0.5  - 


-1.0 


n  recognition,  training 

+  recognition,  nontraining 

O  rejection,  training 


-1.0       -0.5 


0.0 


0.5 


1.0 


1.0 

'    '    1 

■    ■    1  \'    ■    ■,    "   r.  ■    '!■    ■    1 

- 

0.6 

. 

w 

3" 

0.0 

- 

m 

-0.5 

- 

■^0- 

D 

rec 

ognition,  testing 

■^? 

-1.0 

^   0 

.    1 

rej 

eclion,  testing 

-1.0 


-0.5 


0.0 


0.5 


1.0 


Figure  32.  Experiment  II:  Resulting  feature  space  when  orthogonality  is  imposed 
on  the  input  layer  of  the  MLP.  In  the  top  figure  squares  indicate  the 
recognition  class  training  exemplars,  triangles  indicate  white  noise 
rejection  class  exemplars,  and  plus  signs  are  the  images  of  vehicle  la 
not  used  for  training.  In  the  bottom  figure,  squares  are  the  peak 
responses  from  vehicles  lb  and  Ic,  triangles  are  the  peak  responses 
from  vehicles  2a  and  2b. 


84 


Figure  33.  Experiment  11:  Resulting  ROC  curve  with  orthogonality  constraint. 
Convinced  that  the  filter  represents  a  better  test  for  classification  than  the  linear  filter, 
we  now  examine  the  result  for  the  other  features  of  interest.  Figure  34  shows  the  output 
response  for  this  filter  for  one  of  the  images.  As  seen  in  the  figure,  a  noticeable  peak  at  the 
center  of  the  output  plane  has  been  achieved.  This  shows  that  the  filter  maintains  the  local- 
ization properties  of  the  linear  filter. 

In  this  way  the  characterization  of  the  rejection  class  by  its  second  order  statistics,  the 
addition  of  the  orthogonality  constraint  at  the  input  layer  to  the  MLP  and  the  use  of  a  non- 
linear topology  has  resulted  in  a  superior  classification  test.  i 


4.6.3  F.xperimenl  ITT  -  snh.spacp  noisp,  training 

The  next  experiments  describes  an  additional  modification  to  this  technique.  One  of 
the  issues  of  training  nonlinear  systems  is  the  convergence  time.  Training  methods  which 
require  overly  long  training  times  are  not  of  much  practical  use.  We  have  already  shown 


85 


l.O 
0.8 
0.6 

o.« 

o.z 

0.0 

: 

1 

k- 

Figure  34.  Experiment  II:  Output  response  to  an  image  from  the  recognition  class 
training  set. 

how  to  reduce  the  training  complexity  by  recognizing  that  we  can  sufficiently  describe  the 
rejection  class  with  white  noise  sequences.  We  now  show  a  more  compact  description  of 
the  rejection  class  which  leads  to  shorter  convergence  times,  as  demonstrated  empirically. 
This  description  relies  on  the  well  known  singular  value  decomposition  (SVD). 

We  view  the  random  white  sequences  as  stochastic  probes  of  the  performance  surface 
in  the  whitened  image  space.  The  classifier  discriminant  function  is,  of  course,  not  deter- 
mined by  the  rejection  class  alone.  It  is  also  affected  by  the  recognition  class.  We  have 
shown  previously  that  the  white  noise  sequences  enable  us  to  probe  the  input  space  more 
efficiently  than  examining  all  shifts  of  the  recognition  exemplars.  However,  we  are  still 
searching  a  space  of  dimension  equal  to  the  image  size,  WjWj . 

One  of  the  underlying  premises  to  a  data  driven  approach  is  that  the  information  about 
a  class  is  conveyed  through  exemplars.  In  this  case  the  recognition  class  is  represented  by 


86 

N,  <  W,  Wj  exemplars  placed  in  the  data  matrix  Xj  e  9{  '  ^     ' .  It  is  well  known  that  if 
jcj ,  if  it  is  full  rank,  can  be  decomposed  with  the  SVD  as 


X.,  =  UAV^.  (53) 


where  the  columns  1/  e  9?  "  are  an  ortho-normal  basis  that  span  the  column  space 
of  the  data  matrix,  A  are  the  singular  values,  and  V  is  an  orthogonal  matrix.  This  decom- 
position has  many  well  known  properties  including  compactness  of  representation  for  the 
columns  of  the  data  matrix[Gerbrands,  1981].  Indeed,  as  has  been  noted  by  Gheen[1990], 
the  SDF  can  be  written  as  a  function  of  the  SVD  of  the  data  matrix. 

ft^fl^r  =  UA~  V  d  (54) 

We  will  use  this  recognition  class  representation  to  further  refine  our  description  of  the 
rejection  class  for  training.  As  we  stated,  the  underlying  assumption  in  a  data  driven 
method,  is  that  the  data  matrix  Xj  conveys  information  about  the  recognition  class,  any 
information  about  the  recognition  class  outside  the  space  of  the  data  matrix  is  not  attain- 
able from  this  perspective.  The  information  certainly  exists,  but  there  is  no  mechanism  by 
which  to  include  it  in  the  determination  of  the  discriminant  function  within  this  frame- 
work. This  does  however  lead  to  a  more  efficient  description  of  the  rejection  class.  We  can 
modify  our  optimization  criterion  to  reduce  the  response  to  white  sequences  as  they  are 
projected  into  the  N,  -dimensional  subspace  of  the  data  matrix.  Effectively  this  reduces  the 
search  for  a  discriminant  function  in  an  /VjA'j -dimensional  space  to  an  A', -dimensional 
subspace. 


87 

The  adaptation  scheme  of  backpropagation  allows  a  simple  mechanism  to  implement 
this  constraint.  The  adaptation  of  matrix  W[  at  iteration  k  can  be  written  as 

WiC/t+l)  =  Wi{k)+x,{k)e/J(k)  (55) 

where  e',  is  a  column  vector  derived  from  the  backpropagated  error  and  x^(k)  is  the 
current  input  exemplar  from  either  class  presented  to  network  which,  by  design,  lies  in  the 
■«.  subspace  spanned  by  the  columns  of  U .  From  equation  (55)  if  the  rejection  class  noise 
exemplars  are  restricted  to  lie  in  the  data  space  of  Xj ,  which  can  be  achieved  by  projecting 
•  random  vectors  of  size  W,  onto  the  matrix  U  above,  and  W^  is  initialized  to  be  a  random 
projection  from  this  space  we  will  be  assured  that  the  columns  of  W^  only  extraci  infor- 
mation from  the  data  space  of  Xj  ■  This  is  because  the  columns  of  W,  will  only  be  con- 
structed from  vectors  which  lie  in  the  columns  space  of  {/  and  so  will  be  orthogonal  to 
any  vector  component  that  lies  in  the  null  space  of  U . 

The  search  for  a  discriminant  function  is  now  reduced  from  within  an  /\/|  A'2 -dimen- 
sional space  to  a  search  from  within  an  W, -dimensional  space.  Due  to  the  dimensionality 
reduction  achieved  we  would  expect  the  convergence  time  to  be  reduced. 

This  is  the  method  that  was  used  for  the  third  experiment.  Rejection  class  noise  exem- 

N  X  I 

plars  were  generated  by  projecting  a  random  vector,  n  €  9?  '      ,  onto  the  basis  U  by 
x^^j  =  Un.  In  figure  35  the  resulting  discriminant  function  is  shown  as  in  the  previous 


■   !     5  fJtV     f*  T^".?>,f^ 


88 

experiments  and  the  result  is  similar  to  experiment  II.  The  classifier  performance  as  mea- 
sured in  table  3  and  the  ROC  curve  of  figure  36  are  also  nominally  the  same. 


1.0 


0.5 


3~       0.0 


-0.5 


-1.0 


n  recognition,  training 

+  recognition,  nontraining 

O  rejection,  training 


■1.0         -0.5  0.0 


0.5 


1.0 


1.0 

o'\-  \ 

- 

0.5 

- 

'\^ 

\        : 

3" 

0.0 

■ 

-0.5 

D 

recognition,  testing 

^% 

-1.0 

0 

1    i 

rejection,  testing 

- 

-1.0         -0.5  0.0  0.5  1.0 


Figure  35.  Experiment  III:  Resulting  feature  space  when  the  subspace  noise  is 
used  for  training.  Symbols  represent  the  same  data  as  in  the  previous 
case. 


There  are,  however,  two  notable  differences.  Examination  of  figure  37  shows  that  the 
output  response  to  shifted  images  is  even  lower  allowing  for  better  localization.  This  con- 


?^ 


89 


Figure  36.  Experiment  III:  Resulting  ROC  curve  for  subspace  noise  training, 
dition  was  found  to  be  the  case  throughout  the  data  set.  Of  more  significance  is  the  result 
shown  in  figure  38  in  which  we  compare  the  learning  curves  of  all  of  the  experiments  pre- 
sented here.  In  this  figure  the  dashed  and  dashed-dot  lines  are  the  learning  curves  for 
experiments  II  and  III,  respectively.  In  this  case  the  convergence  rate  was  increased  nomi- 
nally by  a  factor  of  three,  from  100  epochs  to  approximately  30  epochs.  Here  an  epoch 
represents  one  pass  through  all  of  the  training  data. 


4.6.4  Fxperimenl  TV  -  convex  hull  approach 

In  this  experiment  we  present  a  technique  which  borrows  from  the  ideas  of  Kumar  et 
al  [1994].  This  approach  designed  an  SDF  which  rejects  images  which  are  away  from  the 


90 


Figure  37.  Experiment  III:  Output  response  to  an  image  from  the  recognition 
class  training  set 


0 

learning  curve 

10 

i 

' 

''                 '          '''■■'■! 

[ 

10-' 

!_--=^ 

■■-\^  ~ 

>--x 

- 

\ 

\            \ 

: 

10-2 

- 

\        \ 

-: 

a  10  •^ 

c 
10-* 

- 

\        '-V.^wW^ 

-= 

10-5 

j- 

-. 

10-'' 

1 

" 

] 

10 

100                         1000 

IOC 

00 

epoch 

Figure  38.  Learning  curves  for  three  methods.  Experiment  II:  White  noise 
training  (dashed  line).  Experiment  III:  subspace  noise  (dashed-dot 
line).  Experiment  IV:  subspace  noise  plus  convex  hull  exemplars 
(solid  line). 


91 

boundary  of  the  convex  hull  of  ihe  training  set.  The  convex  hull  of  a  set  {x,,  x^-,  ...,Xf^} 
is  defined  as  all  points  which  can  be  represented  as 

i=  1 

where  the  a,-  's  are  constrained  to  satisfy 

N, 

«,>0  Xa,=   l. 

/=  1 
It  was  pointed  out  that  by  Kumar  et  al  that  when  the  peak  constraints  for  the  SDF  (or 

any  of  the  linear  distortion  invariant  filters)  are  all  set  to  unity,  points  in  the  interior  of  the 
convex  hull  over  the  training  exemplars  are  recognized  as  well  as  the  those  near  the  extre- 
mal points.  This  would  include,  for  example,  an  image  which  is  the  mean  of  the  training 
exemplars.  Examination  of  imagery  derived  from  points  that  are  closer  to  the  interior  of 
the  convex  hull,  rather  than  near  the  boundary  would  indicate  that  they  are  not  representa- 
tive of  the  recognition  class. 

It  was  suggested  that  a  way  to  mitigate  this  property  was  to  set  the  desired  output  over 
the  training  set  to  be  complex,  unity  magnitude  and  mean  zero.  The  magnitude  of  the  out- 
put was  then  used  as  the  response.  In  this  way  only  points  near  the  boundary  of  the  convex 
hull  are  recognized. 

The  approach  taken  here  is  similar  in  that  exemplars  from  the  interior  of  the  convex 
hull  are  used  as  representative  of  the  rejection  class.  The  difference  is  that  this  description 
is  included  in  the  learning  process  without  a  priori  determining  the  decision  surface  (e.g. 


92 
magnitude  of  the  correlator  output).  It  is  the  nonlinear  iterative  process  which  determines 
how  to  separate  the  recognition  class  exemplars  from  the  images  derived  from  the  convex 
hull.  The  result  is  significantly  improved  classification. 

In  this  experiment  we  continue  to  use  random  noise  projected  onto  the  basis  defined 
by  columns  of  the  matrix  U  as  in  experiment  III.  In  addition,  convex  hull  exemplars  are 

generated  by  projecting  a  random  vector  a  €  9?  '  onto  the  data  matrix  Xj .  The  basis 
for  this  approach  is  that  elements  of  the  convex  hull  that  are  distant  from  the  extremal 
points  (the  training  exemplars)  do  not  convey  information  about  the  recognition  class,  and 
so  in  keeping  with  this  idea  we  imposed  a  further  restriction  on  the  coefficients  a, ,  namely 

0.9  1.1 

This  restriction  assures  that  none  of  the  generated  convex  hull  exemplars  lie  too  close 
to  one  of  the  recognition  class  training  exemplars.  Rejection  class  exemplars  from  within 
the  convex  hull  are  randomly  generated  throughout  the  training  from  x^^j  =  X2a  .  Another 
property  of  these  rejection  class  exemplars  is  that  they  also  lie  in  the  subspace  of  the  data 
matrix  aTj  . 

Examination  of  table  3  and  the  ROC  curve  of  figure  40  show  that  this  method  yields 
significantly  improved  classification  performance.  The  discriminant  function  shown  in 
figure  39  is  quite  different  and  much  more  nonlinear  than  in  the  previous  cases.  In  the  fig- 
ure the  convex  hull  exemplars  are  clustered  between  the  subspace  noise  exemplars  and  the 
recognition  class  exemplars.  If  this  is  a  general  property  of  the  type  of  data  we  are  using 
then  it  may  be  a  powerful  method  by  which  to  describe  the  rejection  class  within  the  non- 


93 
linear  framework.  More  analysis  is  needed,  however,  before  this  conclusion  can  be  made. 
We  do  conclude  that  in  this  case  this  method  is  an  effective  means  by  which  to  character- 
ize the  rejection  class.  The  advantage  in  this  technique  versus  the  linear  method  of  Kumar 
et  al  [1994]  is  that  the  training  learns  to  separate  automatically  the  recognition  class  exem- 
plars from  the  convex  hull  exemplars  as  opposed  to  a  priori  assigning  a  complex  desired 
output  for  each  exemplar. 

There  were,  however,  some  difficulties  with  this  technique  which  are  worth  mention  • 
ing.  Recall  that  the  motivation  for  using  orthogonalization  in  the  input  layer  was  to 
increase  the  likelihood  that  a  nonlinear  discriminant  function  was  found.  When  using  con- 
vex hull  exemplars  in  the  rejection  class,  this  may  seem  unnecessary.  In  practice,  however, 
it  was  found  that  when  the  orthogonalization  was  removed,  training  times  became 
extremely  long.  Even  with  orthogonalization  we  can  see  from  the  learning  curve  (solid 
line)  in  figure  38  that  convergence  took  over  an  order  of  magnitude  longer  as  in  experi- 
ment m. 

There  were  also  stability  issues  as  well  with  this  type  of  training.  The  training  became 
unstable  nearly  as  often  as  it  converged.  However,  when  the  training  did  converge,  as  in 
the  results  shown,  the  classification  performance  was  always  superior.  Convergence,  or  its 
lack,  can  be  directly  measured  from  the  MSE.  When  convergence  was  not  reached  in  a 
suitable  number  of  iterations  (typically  lOCK)  epochs),  the  algorithm  was  restarted  with  a 
new  random  parameter  initialization.  Due  to  the  improved  classification  results,  we 
believe  that  this  method  bears  further  study. 


94 


3     0 

convex  hull  exemiklars 


n  recognition,  training 
+  recognition,  nontrainin^ 
O  rejection,  training 


D  recognition,  testing 
O  rejection,  testing 


0 


Figure  39.  Experiment  IV;  resulting  feature  space  from  convex  hull  training.  In 
the  figure  symbols  are  as  before.  In  the  top  figure  one  difference  to 
note  is  that  the  convex  hull  exemplars  (indicated  by  arrow)  are  closer 
to  the  discriminant  boundary  and  play  a  greater  role  in  determining  the 
shape  of  the  function. 


95 


ROC  curve 

1.0 

:  /'^ 

1^-1 

^^^ 

0.8 

r 

0.6 

•n 

1- 

0.4 

0.2 

r 

0.0 

1          ...          1          . 

1 

0 

0 

0.2 

0.4                    0.6 

0.8 

1 

0 

P,. 

Figure  40.  Experiment  IV:  Resulting  ROC  curve  with  convex  hull  approach. 


CHAPTER  5 
INFORMATION-THEORETIC  FEATURE  EXTRACTION 

*;  I    TntrnHiirtinn 

The  material  presented  in  this  section  is  motivated  by  the  analysis  of  the  previous 
chapter.  Recall  that  beginning  with  section  4.6.1  our  analysis  of  the  nonlinear  system  was 
aided  by  reference  to  a  feature  space  within  the  MLP  architecture.  The  designation  of  the 
.feature  space,  which  led  to  useful  modifications  to  the  iterative  training  algorithm.  In  the 
previous  analysis,  however,  the  generation  of  the  features  were  a  function  of  training  algo- 
rithm with  regards  to  the  desired  system  response.  The  analysis  did  show,  however,  that 
the  representation  of  the  data  in  the  feature  space  was  critical  to  the  classification  perfor- 
mance. This  section  examines  the  de-coupling  of  the  feature  extraction  stage  from  the 
training  of  the  discriminant  function  in  overall  system  architecture 

Of  course,  when  the  feature  extraction  is  de-coupled  it  is  important  to  use  a  criterion 
which  is  related  to  the  overall  goal  -  classification.  The  approach  described  here  uses  an 
information  theoretic  measure,  namely  mutual  information,  as  a  criterion  for  adaptation. 
We  will  show  that,  although  the  feature  extraction  is  de-coupled  from  the  classifier  train- 
ing, the  resulting  features  are,  in  fact,  specific  to  classification.  This  method  represents  a 
new  advance  to  the  area  of  information  theoretic  signal  processing  and  as  such  has  wide 
application  beyond  nonlinear  extensions  to  MACE  filters  and  classification. 

We  have  recently  presented  a  maximum  entropy  based  technique  for  feature  extraction 
[Fisher  and  Principe,  1995c]  which  we  now  extend  to  mutual  information.  This  method 

96 


97 
differs  from  previous  methods  in  that  it  is  not  limited  to  Unear  topologies  [Linsker,  1988] 
nor  uni-modal  probability  density  functions  (PDFs)  [Bell  and  Sejnowski,  1995].  The  ^ 
method  is  directly  applicable  to  any  nonlinear  mapping  which  is  differentiable  in  its 
parameters.  In  particular,  we  demonstrate  that  the  technique  can  be  applied  to  a  feed-for- 
ward multi-layer  perceptron  (MLP)  with  an  arbitrary  number  of  hidden  layers.  It  is  also 
shown  that  the  resulting  iterative  training  algorithm  fits  naturally  into  the  back-propaga- 
tion method  for  training  multi-layer  perceptrons. 

In  this  section  we  present  some  background  information  on  feature  extraction  and 
information  theoretic  approaches  to  signal  processing.  This  is  followed  by  the  derivation 
of  the  feature  extraction  method.  Experimental  results  will  be  presented  which  illustrate        .». 
the  usefulness  of  this  approach.  We  will  conclude  with  the  logical  placement  of  this        , 
method  within  nonlinear  MACE  filters  as  well  as  experimental  results  which  can  be 
directly  compared  to  the  results  of  section  4.6. 

5.7.  Motivation  for  Feature  F.xtractinn 
We  have  shown  in  section  3.2,  that  theoretically  at  least,  the  MSE  criterion  can  be  used 
to  iteratively  train  a  universal  approximator  (such  as  the  multi-layer  perceptron)  to  classify 
raw  input  variables  directly.  That  is,  the  MSE  criterion  coupled  with  a  universal  approxi- 
mator estimates  posterior  class  probabilities.  An  issue  for  any  estimator  is  the  variance. 

There  are  two  ways  to  reduce  the  variance  of  the  estimate  of  the  posterior  probabilities 
in  this  case,  we  can  either  supply  more  data  (which  may  not  be  possible  or  practical)  or  we 
can  somehow  impose  constraints  on  the  system.  Feature  extraction  is  a  means  by  which 
constraints  can  be  imposed  on  the  system  [Fukanaga,  1990,  Bishop,  1995].  This  does  not 
contradict  the  results  of  the  previous  chapter.  The  previous  chapter  illustrated  that  moving 


98 

to  a  probabilistic  framework  coupled  with  a  nonlinear  topology  led  to  improved  classifica- 
tion performance.  Furthermore,  the  previous  chapter  addressed  the  lack  of  training  data 
through  efficient  descriptions  of  the  rejection  class  by  its  second-order  statistics  and  the 
recognition  class  by  a  subspace.  The  discussion  in  this  chapter  seeks  to  extend  this 
approach  beyond  second-order  descriptions  by  exploiting  the  underlying  structure  of  both 
classes. 

It  is  imperative  in  any  feature  extraction  algorithm  that  the  criterion  by  which  the  fea- 
tures are  selected  is  somehow  related  to  the  overall  system  objective.  Suitable  criteria  in 
classification  are  not  always  easily  employed  as  a  means  for  classification  (e.g.  likelihood 
ratios  which  require  prior  knowledge  of  the  underlying  probability  density  function).  Con- 
sequently, sub-optimal  feature  sets  are  used  or  even  more  commonly  user  defined  ad  hoc 
features  based  on  intuitive  assumptions,  but  without  any  rigorous  relationship  to  classifi- 
cation. 

It  is  often  the  case  that  projecfing  high-dimensional  data  onto  a  smaller  subspace 
results  in  improved  performance  of  a  nonparametric  classifier.  This  statement  is  counter- 
intuitive as  we  cannot,  in  general,  project  onto  a  subspace  without  the  loss  of  some  infor- 
mation. The  results  of  the  previous  chapter,  however,  confirm  this  assertion.  In  the  final 
experiments,  in  our  construction  of  rejection  class  exemplars,  we  implicitly  restricted  the 
search  space  to  two  distinct  subspaces.  In  the  first  case,  we  confined  the  rejection  class  to 
the  PCA  subspace  of  the  recognition  class,  while  in  the  second  case  we  restricted  some  of 
the  rejection  class  exemplars  to  the  convex  hull  of  the  recognition  class.  In  doing  so,  no 
information  concerning  the  recognition  class  was  lost,  but  the  intrinsic  dimensionality  of 
the  data  was  reduced  to  the  number  of  recognition  class  exemplars,  A', . 


99 

We  note  that  while  these  constraints  led  to  improved  classifier  performance,  the  sub- 
spaces  used  were  more  closely  related  to  signal  representation,  rather  than  classification.  It 
should  be  noted,  however,  that  for  an  W -class  classification  problem  the  optimal  Bayes 
classifier  can  be  derived  from  an  (N-  1 ) -dimensional  feature  space,  where  the  features 
are  the  posterior  probabilities  of  each  class  given  the  observation  of  the  data  (i.e. 
P{C\X=x) ).  This  point  underlies  a  key  difference  between  feature  extraction  for  classifi- 
cation versus  feature  extraction  for  signal  representation  [Fukanaga,  1990].  In  classifica- 
tion it  is  the  number  of  classes,  N,  which  determines  the  minimum  feature-space 
dimension. 

The  feature  extraction  approach  to  image  classification  methods  is  often  decomposed, 
as  in  figure  41,  into  two  stages:  feature  extraction  followed  by  discrimination.  In  some 
cases,  the  decomposition  is  explicit  while  in  others  it  is  a  matter  of  interpretation.  Often 
the  features  are  determined  in  ad  hoc  fashion  based  on  an  intuitive  understanding  of  the 
data,  but  not  explicitly  with  respect  to  classification.  As  an  example,  we  can  interpret  the 
linear  distortion  invariant  filtering  methods  as  a  decomposition  of  a  pre-whitening  filter 
(feature  extraction)  followed  by  an  SDF  (synthetic  discriminant  function).  Similarly,  the 
NL-MACE  architecture  that  we  are  working  with  can  decomposed  in  this  way  as  shown  in 
figure.  In  fact  as  the  results  have  been  reported,  this  is  exactly  the  decomposition  that  has 
been  used  for  feature  space  analysis  to  this  point.  However,  such  a  decomposition  is  arbi- 
trary as  the  features  extracted  were  driven  by  a  single  optimization  criterion  derived  from 
the  output  space  and  which  is  coupled  to  the  training  of  the  discriminator. 


100 


The  goal  of  feature  extraction  is  always  to  improve  the  overall  system  classification 
performance.  In  the  technique  we  are  presenting  now,  the  decomposition  is  explicit.  The 
feature  extraction  is  decoupled  from  the  determination  of  the  discriminant  function.  | 


feature  extraction 


d{[   ],0)) 


discriminator 


rf(y,,  CO) 


Figure  41.  Classical  pattern  classification  decomposition. 


xe  9? 


I 

W,    X  /V;  I 


y  =  Ax 


pre-processor 


--=^r0|--t6^ 


I    I 

Feature  Extraction  Stage     {         | 


Figure  42.  Decomposition  of  NL-MACE  as   a  cascade  of  feature  extraction 
followed  by  discrimination. 


lOI 

5.'^   Tnfnrmation  1 

At  this  point  we  provide  some  background  for  the  technique  we  are  using.  As  this 
material  is  more  specific  to  information  theoretic  processing  we  feel  that  it  is  more  appro- 
priately presented  at  this  time.  The  method  we  describe  here  combines  mutual  information 
maximization  with  Parzen  window  probability  density  function  estimation.  These  con- 
cepts are  reviewed. 

■5.^.1    Mutual  Tnfnmialion  as  a  .Self-Organizinff  Principle 

Entropy  based  information  theoretic  methods  have  been  applied  to  a  host  of  problems 
(e.g.  blind  separation  [Bell  and  Sejnowski,  1995],  parameter  estimation  [Kapur  and  Kesa- 
van,  1992],  and,  of  course,  coding  theory  [Shannon,  1948],  etc.).  Linsker  [1988]  has  pro- 
posed mutual  information  (derived  from  entropy)  as  a  self-organizing  principle  for  neural 
systems.  The  premise  is  that  a  mapping  of  a  signal  through  a  neural  network  should  be 
accomplished  so  as  to  preserve  the  maximum  amount  of  mutual  information.  Linsker 
demonstrates  this  principle  of  maximum  information  preservation  for  several  problems 
including  a  deterministic  signal  corrupted  by  gaussian  noise. 

The  appeal  of  mutual  information  as  a  criterion  for  feature  extraction  is  threefold. 
First,  mutual  information  exploits  the  structure  of  the  underlying  probability  density  func- 
tion. Second,  adaptation,  as  we  will  show,  can  be  used  to  remove  as  much  uncertainty 
about  the  input  class  using  observations  of  the  output,  y  =  g{x,  a) .  Third,  this  is  accom- 
plished within  the  constraints  of  the  mapping  topology,  g([   ],  a) . 


102 


Three  equivalent  formulations  of  mutual  information  are 


I{x,y)  =  h(x)  +  h(j)-h{x,y). 

I{x,y)  =  /i.O')-;iCy!x),and 

l{x,y)  =  h{x)-h{x\y). 


(56) 

m 

(58) 


where  I{x,y)  is  the  mutual  information  of  the  RVs  X  and  Y.  In  equations  56  through  58, 
h{x)  is  the  differential  entropy  measure  (which  we  will  refer  to  as  simply  entropy) 
[Papoulis,  1991]  of  the  RV  X,  h(x\y)  is  the  entropy  of  the  RV  X  conditioned  on  the  RV 
Y,  and  h(x,y)  is  the  joint  entropy  of  the  RVs  A'  and  Y .  Entropy  is  used  to  quantify  our 
uncertainly  about  a  given  random  variable  or  vector.  Mutual  information  quantifies  the 
relative  uncertainty  of  one  random  variable/vector  with  respect  to  another;  it  measures  the 
information  that  one  random  variable/vector  conveys  about  another.  We  note  that  manipu- 
lation of  mutual  information  is  dependent  upon  the  ability  to  manipulate  entropy.  In  fact 
we  can  manipulate  the  entropy  related  terms  of  mutual  information  independently. 

Following  the  notation  of  Papoulis,  the  entropy  of  a  continuous  random  variable  or 

vector  (RV),  A"  e  9?    ,  is  defined  as 


h{x)  =  -\\o%(fx[x))fx(x)dx. 


(59) 


where  fx(x)  is  the  probability  density  function  of  the  RV,  the  base  of  the  logarithm  is  arbi- 
trary, and  the  integral  is  N 4o\i.  The  conditional  and  joint  forms  of  entropy  used  in  56 
substitute  the  joint  and  conditional  probability  density  functions,  respectively,  into  equa- 


103 

tion  59.  Inspection  of  equation  59  shows  that  entropy  can  be  seen  as  the  expected  value  of 
the  log  of  the  probability  density  functions,  or 

/.(x)  =  £{Iog(/j.(x))}.  (60) 

Several  properties  of  the  entropy  measure  are  of  interest. 

1.  If  the  RV  is  restricted  to  a  finite  range  in  9?    ,  entropy  is  maximized  for  the  uniform  dis- 
tribution. 

2.  If  the  diagonal  elements  of  the  covariance  matrix  are  held  constant,  entropy  is  maxi- 
mized for  the  nonnal  distribution  with  diagonal  covariance  matrix. 

3.  If  the  RV  is  transformed  by  a  mapping  g:9?    ->  9?     then  the  entropy  of  the  new  RV, 
y  =  sW  .  satisfies  the  inequality 

h(j)<h{x)  +  E{\o?.(\Jji^)},  («) 

with  equality  if  and  only  if  the  mapping  has  a  unique  inverse,  where  J^y  '^  the  Jaco- 

bian  of  the  mapping  from  A"  to  F. 

Regarding  the  first  two  properties  we  note  that  in  both  instances  each  element  of  the 

RV  is  statistically  independent  from  the  other  elements.  We  will  make  use  of  the  first 
property  in  the  method  presented  here. 

Equation  61  implies  that  by  transforming  a  RV  we  can  increase  the  amount  of  infonna- 
tion  that  it  conveys;  that  is,  the  RV  Y  derived  from  the  RV  X  can  have  more  information 
than  X .  This  is  a  consequence  of  working  with  continuous  RVs.  In  general  the  continuous 
entropy  measure  is  used  to  compare  the  relative  entropies  of  several  RVs.  We  can  see  from 
equation  61,  that  if  two  RVs  are  mapped  by  the  same  invertible  /mear  transformation  their 
relative  entropies  (as  measured  by  the  difference)  remains  unchanged.  However,  if  the 


104 
mapping  is  nonlinear,  in  which  case  the  second  term  of  equation  61,  is  a  function  of  the 
random  variable,  it  is  possible  to  change  the  relative  information  of  two  random  variables. 
From  the  perspective  of  classification  this  is  an  important  point.  If  the  mapping  is 
topological  (in  which  case  it  lias  a  unique  inverse),  their  is  no  increase,  theoretically,  in  the 
ability  to  separate  classes.  That  is,  we  can  always  reflect  a  discriminant  function  in  the 
transformed  space  as  a  waiping  of  another  discriminant  function  in  the  original  space. 
This  is  not  true,  however,  for  a  mapping  onto  a  subspace.  Our  implicit  assumption  here  is 
that  we  are  unable  to  reliably  determine  a  discriminant  function  in  the  full  input  space.  As 
a  consequence  we  seek  a  subspace  mapping  that  is  by  some  measure  optimal  for  classifi- 
cation. We  cannot  avoid  the  loss  of  inforination  (and  hence  some  ability  to  discriminate 
classes)  when  using  a  subspace  mapping.  However,  if  the  criterion  used  for  adapting  the 
mapping,  is  information  (entropy)  based,  we  can  perhaps  minimize  this  loss.  It  should  be 
mentioned  that  in  all  classification  problems  there  is  an  implicit  assumption  that  the 
classes  to  be  discriminated  do  indeed  lie  in  a  subspace. 

'i^.?.  Mnnial  Information  as  a  Criterion  for  Feature  Fxtraction 

It  is  our  intent  to  use  mutual  information  as  a  criterion  for  feature  extraction  (prior  to 
classification).  The  use  of  mutual  information  in  this  way  can  be  motivated  sirnply  by 
Fano's  inequality  [1961]  which  gives  a  lower  bound  for  the  probability  of  error  (or  con- 
versely an  upper  bound  on  the  probability  of  correct  classification)  when  estimating  a  dis- 
crete RV  from  another  RV  as  a  function  of  the  conditional  entropy  (and  ultimately  the 
mutual  information).  Fano's  inequality  is  stated  as  follows,  given  the  discretely  distributed 


105 
RV  X  and  a  related  RV  Y  the  probability  of  incorrectly  estimating  X  based  on  an  estimate 
derived  from  observations  of  Y  is  lower  bounded  by 

-       h(x\y)-\ 

where  N  is  the  number  of  discrete  events  or  classes  represented  by  the  RV  X  and  X  is  the 
estimate  of  X .  Using  equation  58  we  can  rewrite  Fano's  inequality  as  a  function  of  the 
mutual  information  of  X  and  Y  as  follows 

PiX*X)>''^'^-'^''(>-y  (63) 

^  log(A') 

In  this  form  of  Fano's  inequality  we  see  that  the  lower  bound  on  the  error  probability  is 
minimized  when  the  mutual  information  between  X  and  Y  is  maximized. 
It  can  be  shown  that  the  upper  bound  on  the  probability  of  error  is 


P{X^X)<(\-mi.\{Pf])<{\-\/N),  m 

where  P,-  is  the  prior  probability  of  the  ith  class  of  X  and  max{P,}  is  the  maximum  over 
the  set  of  P,  's.  Equation  64  is  itself  upper  bounded  by  ( 1  -  1  /N) ;  the  case  in  which  all 
classes  are  equally  likely.  This  is  upper  bound  is  met  with  equality  when  the  mutual  infor- 
mation between  X  and  Y  is  zero  and  the  optimal  class  estimator  reverts  to  choosing  the 
class  with  the  greatest  prior  probability. 

This  approach  is  depicted  in  figure  43  with  regards  to  a  Bayesian  framework.  In  the 
figure,  C  is  a  discrete  RV  which  represents  the  class.  The  function 

P(C)  =  P,  I  =  \...N.  (65) 


106 
is  the  probability  density  function  of  the  class,  where  P,  is  the  prior  probability  of  the  ith 
class  and  N  is  the  number  of  classes.  The  probability  density  function  of  X  is  conditioned 
on  the  class.  The  feature  vector,  y  =  g{x,  a) ,  is  derived  from  the  observation  of  X  and  is 
itself  a  random  vector  prior  to  observation.  It  is  from  the  feature  vector,  y  ,  that  we  wish  to 
estimate  the  class.  Our  goal  is  to  choose  the  parameters,  a,  of  the  mapping  g{[  ],  a) 
such  that  the  mutual  information  between  ¥  and  C  is  maximized.  We  are  still  left  with  the 
task  of  determining  the  estimator,  C,  however,  from  Fano's  inequality  we  know  that  if 
l{c,y)  is  maximized,  the  lower  bound  on  the  classification  error  will  be  minimized. 


P{C) 

c 

Ax\c) 

X 

gix,  a) 

y 

^ 

c 

C(y) 

1  tJayesian  Source   ( 

\ 

Figure  43.  Mutual  information  approach  to  feature  extraction.  An  observation  of 
the  random  variable  X  is  generated  by  the  probability  density  function 
f(x\C)  which  is  conditioned  on  the  discrete  random  variable  C  which 
is  characterized  by  the  discrete  probability  density  function  P{C) .  The 
features,  y  ,  derived  from  the  observation  of  A" ,  are  used  to  estimate  C. 


■SIS  Prior  Work  in  Tnfnrmalinn  Theoretic  Neural  Processing  ' 

The  concept  of  using  information  theoretic  measures  in  neural  processing  is  not  new. 
One  application,  related  to  feature  extraction  was  for  the  purpose  of  generating  ordered 
maps  [Linsker,  1990].  In  this  work,  a  modification  to  Kohonen's  self-organizing  feature 
map  (SOFM)  [Kohonen,  1988,  1995],  entropy  is  u.sed  as  a  competitive  measure  for  adap- 
tation. Specifically  input  exemplars  are  mapped  onto  a  discrete  lattice  and  entropy  is  used 
as  a  measure  for  determining  which  lattice  point  to  adapt.  The  method  differs  from  the 
presentation  here  in  two  ways 


107 

1.  the  form  of  entropy  used  is  discrete  whereas  we  are  working  with  continuous  entropy, 
and 

2.  the  mapping  from  the  input  to  the  output  is  constrained  to  be  linear,  whereas  the  method 
presented  here  may  be  used  with  arbitrary  nonlinear  maps  (so  long  as  they  are  differen- 
tiable). 

Deco  and  Obradovic[1996]  have  also  presented  extensive  results  on  information  theo- 
retic approaches  to  .leural  processing.  The  techniques  described  differ  from  Linsker's 
method  in  that  they  work  with  the  continuous  form  of  entropy  and  use  nonhnear  map- 
pings. The  constraint,  however,  is  that  the  mapping  be  symplectic  (volume  preserving)  and 
bijective.  These  constraints  restrict  the  method  to  a  subset  of  the  mappings  which  are 

9?"  ^  SR'^.  As  we  have  stated,  from  a  theoretical  point  of  view,  such  mappings  in  no  way 
increase  our  ability  to  discriminate  classes.  Furthermore,  it  is  our  implicit  assumption  that 
the  dimensionality  reduction  is  one  of  the  motivating  factors  for  feature  extraction  prior  to 
classification.  Deco  and  Obradovic  also  show  that  if  the  mapping  function  is  chosen  to  be 
linear  in  its  parameters,  very  little  can  be  done  to  manipulate  the  information  at  the  output 
of  the  mapping  without  prior  knowledge  of  the  input  PDF. 

Bell  and  Sejnowski  [1995]  present  yet  another  approach  to  information  theoretic  map- 
pings. Their  technique  is  applicable  to  subspace  projections.  It  is  limited  in  that  it  manipu- 
lates entropy  only  if  the  underlying  distribution  in  the  input  space  is  uni-modal. 
Furthermore,  it  is  restricted  to  nonlinear  MLP  architectures  of  a  single  layer.  The  method 
we  present  here  has  neither  of  these  restrictions. 

Viola  et  al  [1996]  have  taken  a  similar  approach  to  the  method  presented  here  for 
entropy  manipulation.  The  work  of  Viola  et  al  differs  in  that  it  does  not  address  arbitrary 
nonlinear  mappings  directly,  the  gradient  is  estimated  stochastically,  and  entropy  is  manip- 


108 
ulated  explicitly.  Their  approach  is  similar  to  the  approach  in  communications  theory, 
wherein  the  communications  channel  (or  mapping)  is  assumed  to  be  fixed.  Mutual  infor- 
mation is  then  used  to  estimate  the  source  of  the  observations.  A  significant  difference  in 
the  method  presented  here  is  that  the  mapping  (or  communications  channel)  is  not 
assumed  to  be  fixed,  rather  it  is  parameterized  and  we  are  free  to  choose  the  parameters  in 
order  to  manipulate  entropy. 

S  •^  4  Nonparametric  PDF  F.stimation 

One  obstacle  to  using  mutual  information  as  the  figure  of  merit  is  that  it  is  an  integral 
function  of  the  PDF  of  a  continuous  random  variable.  Since  we  cannot  work  with  the  PDF 
directly  (unless  assumptions  are  made  about  its  form),  we  rely  on  nonparametric  esti- 
mates. Nonparametric  density  estimation  in  a  high-dimensional  space  is  an  ill-posed  prob- 
lem. The  approach  described  here,  however,  relies  on  such  estimates  in  the  output  space, 
as  depicted  in  figure  44,  where  the  dimensionality  is  under  the  control  of  the  designer. 


z. 


?([  la) 
feature  extraction 


■y, 


information  is  observed  in  the 
low  dimensional  output  space 
■  and  used  to  adapt  the  parame- 
ters of  the  mapping 


Figure  44.    Mapping  as  feature  extraction.  Information  content  is  measured  in  the  low 
dimensional  space  of  the  observed  output. 


The  Parzen  window  method  [Parzen,  1962],  which  we  will  use,  is  a  nonparametric 
kernel-based  method  for  estimating  probability  density  functions.  The  Parzen  window 


109 

estimate  of  the  probability  distribution,  f{u) ,  of  a  random  vector  y  e  9?     at  a  point  u  is 
defined  as 


«")  =  (iJ-)Z^tV,-«).  ■■-      ■■■(66) 

1  =  1  '  ■ 

The  vectors  y^  €  SR'^  are  observations  of  the  random  vector  and  k:([  ])  is  a  kernel 

function  which  itself  satisfies  the  properties  of  PDFs  (i.e.  k(«)  >  0  and  JKiu)du  =  1 ). 

Tiie  Parzen  window  estimate  can  be  viewed  as  a  convolution  of  tlie  estimator  kernel  with 
the  observations.  Since  we  wish  to  make  a  local  estimate  of  the  PDF,  the  kernel  function 
should  also  be  localized  (i.e.  uni-modal,  decaying  to  zero).  In  the  method  we  describe  we 
will  also  require  that  k([  ])  be  differentiable  everywhere. 

There  are  several  properties  of  the  Parzen  density  estimate  of  note.  If  the  estimator 
kernel  function  satisfies  the  properties  above,  the  estimate  will  satisfy  the  properties  of  a 
PDF.  In  the  limit  as  N^—>  ^,  the  estimator  approaches  the  true  underlying  distribution 
convolved  with  the  kernel  function,  that  is 

lim  /y(«)  =/y(u)*K(u),  (67) 

consequently,  the  Parzen  window  estimator  is  a  biased  estimator.  The  bias  can  be  made 
arbitrarily  small  by  reducing  the  extent  of  the  kernel  at  the  cost  of  raising  the  variance 
[Hardle,  1990]. 

In  the  multidimensional  case  the  form  of  the  kernel  is  typically  gaussian  or  hypercube. 
As  a  result  of  the  differentiability  requirement  of  our  method,  the  gaussian  kernel  is  most 


no 

suitable  here.  The  computational  complexity  of  the  estimator  increases  with  dimension, 
however,  as  we  will  be  estimating  the  PDF  in  the  output  space  of  our  mapping,  the  dimen- 
sionality can  be  controlled. 

S  4  Derivation  Of  The  Tf.aminf  Algorithm 

Our  goal  is  to  find  features  that  convey  maximum  information  about  the  input  class. 

How  do  we  adapt  the  parameters  a  of  a  mapping  such  that  this  is  the  case?  We  now  show 

how  the  Parzen  window  density  estimator  coupled  with  a  property  of  entropy  can  be  used 

to  accomplish  this  goal.  -~ 

Consider  the  mapping  J?: 5R  ->  9?  ;  Af  <  Af ,  of  a  random  vector  A:  e  9{  ,  which  is 
described  by  the  following  equation  .  '     ■.      •  i  . 

Y  =  giX,a).  (68) 

If  the  mapping  is  nonlinear  we  can  exploit  the  following  property  of  entropy.  If  a  ran- 
dom variable  has  finite  region  of  support,  entropy  is  maximized  for  the  uniform  distribu- 
tion. The  Parzen  windows  estimator,  coupled  with  a  mapping  with  finite  region  of  support 
at  the  output  (e.g.  an  MLP  with  sigmoidal  nonlinearities),  can  be  used  to  minimize  or  max- 
imize the  "distance"  between  the  observed  distribution  and  the  desired  distribution.  Fur- 
thermore, if  the  region  of  support  is  a  hypercube,  as  is  the  case  for  the  MLP  using 
sigmoidal  nonlinearities,  the  features  are  statistically  independent  when  entropy  is  maxi- 
mized. 

Considering  equation  68,  the  method  of  Viola  el  al  estimates  the  value  of  the  input 
parameters,  X ,  rather  than  the  parameters  of  the  mapping,  a .  The  goals  are  very  different. 


I        111 

By  our  choice  of  topology  (MLP)  and  distance  metric  we  are  able  to  work  with  entropy 
indirectly  and  fit  the  approach  naturally  into  a  back-propagation  learning  paradigm. 

As  our  criterion  we  use  integrated  squared  error  between  our  estimate  of  the  output 
distribution,  fyiu.y)  at  a  point  «  over  a  set  of  observations  y ,  and  the  desired  output  dis- 
tribution. fy{u),  which  we  approximate  with  a  summation. 


J  =  ljifyi")-fyi".y))^'iu 

ilr 

=  XjW«;)-/y(«;.:V))'Au 


m 


In  equation  69,  ily  indicates  the  nonzero  region  (a  hypercube  for  the  uniform  distribu-        [ 

tion)  over  which  the  M-fold  integration  is  evaluated.  Assuming  the  output  space  is  sam-       |: 

w 
pled  adequately,  we  can  approximate  this  integral  with  a  summation  in  which  Uj  e  9t 

are  samples  in  M  -space  and  Au  is  represents  a  volume.  , 

We  use  the  Parzen  window  method  [Parzen,  1962]  as  our  estimator  of  the  output  distri- 
bution. The  Parzen  window  estimate  of  a  PDF  is  written 


N. 


/K«,^)  =  fiJ-lS'^0'r-«).  <"» 


v. 


where  k(   )  is  the  kernel  function,  y  =  [y^,  ...,yi^  }  are  the  set  of  observations  at  the 
output  of  the  mapping,  and  u  is  the  location  at  which  the  output  estimate  is  being  com- 


112 

puted.  Since  the  output  observations  are  functional  mappings  of  the  input  data,  we  can 
rewrite  70  as 


M",y)  =  {jJ-]'Z^i8ia.,Xi)-u) 


1=  1 
=  fY{u,g(a,x)) 

1 


OD 


=  (]v"jX'^(s(«'-^.)-") 


The  gradient  of  the  criterion  function  with  respect  to  the  mapping  parameters  is  deter- 
mined via  the  chain  rule  as 


da      [dfAdgAda) 


=  -Au 


lWj)-fyi"ry)M)(M} 


(72) 

J-  "■  ^ 

dgKda] 
J 

where  Ey(«  ,>>)  is  the  observed  distribution  error  over  all  observations  y  .  The  last 

term  in  72,  dg/da,  is  recognized  as  the  sensitivity  of  our  mapping  to  the  parameters  a. 

Since  our  mapping  is  a  feed-forward  MLP  (a  represents  the  weights  and  bias  terms  of  the 


113 
neural  network),  this  term  can  be  computed  efficiently  using  standard  backpropagation. 
The  remaining  partial  derivative,  df/dg ,  is 


Ny  '  ■  ■     >'*'*• 

i=  1 

where  k'(  )  is  the  derivative  of  the  kernel  function  with  respect  to  its  argument. 
Substituting  73  into  72  yields 

.         ^  '  (74) 


]fJ,-^sia,Xi)J^eY("j'y)'<'(yi-"j^^" 


The  terms  in  74,  excluding  the  mapping  sensitivities,  become  the  new  error  direction 
term  in  the  backpropagation  algorithm.  It  is  important  to  distinguish  error  direction  from 
error.  If  the  term  were  an  error  this  would  imply  a  desired  output  (d  =  .y  +  E ),  which  is  the 
case  for  a  supervised  training  algorithm  using  the  mean  square  error  criterion.  However,  in 
general,  the  partial  derivative  only  implies  the  direction  we  would  like  to  perturb  the 
observation.  Later  we  will  show  how  to  interpret  the  error  direction  as  an  actual  error 
(resulting  in  a  much  simplified  algorithm). 

By  reversing  the  order  of  summations  in  the  second  step  of  74  we  see  that  the  error 
direction  term  associated  with  each  observation  is  a  convolution  of  the  estimated  error  in 


114 


the  observed  output  distribution,  EyCu.^") ,  with  the  gradient  of  the  kernel,  k'(  ) .  It  is 
through  the  gradient  of  the  estimator  kernel  that  the  observed  distribution  error  influences 
the  direction  of  each  data  observation  in  the  output  space  and  thereby  (through  backpropa- 
gation)  the  parameters  of  the  mapping.  This  point  will  be  further  illustrated  in  the  next  sec- 
tion for  the  case  of  gaussian  kernels. 

The  adaptation  scheme  is  depicted  in  figure  45.  As  can  be  seen,  this  approach  fits 
readily  into  the  backpropagation  framework.  The  point  set  x  =  {i,,  ...,Ar^}  is  mapped 
to  a  point  set  >>  =  {y^,  ...,_v^}  .  The  criterion  then  estimates  from  the  set  an  error  between 
the  observed  output  distribution  and  the  baseline  output  distribution  (uniform  in  this  case). 
From  this  distribution  error  computed  over  the  range  of  the  output  space,  an  error  direction 
(whose  sign  depends  on  whether  we  wish  to  minimize  of  maximize  entropy)  is  associated 
with  each  data  point  in  the  set  y .  This  error  direction  is  then  backpropagated  through  the 
MLP  in  order  to  modify  the  parameters  of  the  mapping. 


z. 


IsiaA    .    ]) 


BP  algorithm 


Mapping  Network     I 


/!-(«.  y) 


i/(«,i  •  ])  r 


3/ 


^yiy) 


Criterion 


fvi") 


Figure  45.    A  signal  flow  diagram  of  the  learning  algorithm.  The  criterion  block  computes,  as 
a  function  of  the  observed  outputs,  the  error  direction  for  the  mapping  network. 


115 
It  is  interesting  to  compare  this  result  to  supervised  training  using  error  backpropaga- 
tion.  When  training  in  a  supervised  manner  an  explicit  desired  output  J,  is  assigned  to 
each  input  x-,  MSE  minimization  results  in  the  following  adaptation  of  the  mapping 
parameters 


cm 

_  _\_ 

where  >>,■  is  the  observed  response  to  the  input  jr,  and  e,  is  the  observed  output  error.  In 

contrast,  maximizing  or  minimizing  entropy  in  the  manner  described  results  in  the  follow- 
ing adaptation  of  the  mapping  parameters 


m 


which,  neglecting  the  sign  term,  is  the  same  the  same  as  equation  75  with  one  significant 
difference.  The  sign  term  depends  on  whether  we  are  minimizing  or  maximizing  entropy. 

5.5  Gaussian  Kernels 
Examination  of  the  gaussian  kernel  and  its  differential  in  two  dimension  illustrates 
some  of  the  practical  issues  of  implementing  this  method  of  feature  extraction  as  well  as 
providing  an  intuitive  understanding  of  what  is  happening  during  the  adaptation  process. 


116 
The  N-dimensional  gaussian  kernel  evaluated  at  some  u  is  (simplified  for  two  dimen- 
sions) 


>«(")  =     -    ./V/2,^,i/2'-''P' 


jCXpl  -—  I 

27ta        ^  2o  ' 


(2TcriZ|' 

1  .      T    .  ■  (77) 


I  =  o'/.  N  =  2 


The  partial  derivative  of  the  kernel  (also  simplified  for  the  two-dimensional  case)  is 


3k         ,  ,^-1 
=—  =  k(h)Z    u 
du 


1  uJu, 

jexp \u 

Ilia        V  2<3 


(78) 


I  =  aV,  W  =  2 


These  functions  are  shown  in  figure  46  for  the  two-dimensional  case.  Recall  that  the  term 

Y,^y(Uj,y)K'(y^-Uj)hu 
J 

in  74  replaces  the  standard  supervised  error  direction  term  in  the  backpropagation  algo- 
rithm. From  the  figure  we  see  that  when  we  are  maximizing  entropy,  the  distribution  error 
through  the  kernel  functions  acts  as  a  local  attractor  when  the  computed  PDF  error  is  pos- 
itive and  as  a  local  repellor  when  the  PDF  error  is  negative.  When  we  are  minimizing 
entropy  the  behavior  is  opposite.  In  this  way  the  adaptation  procedure  operates  in  the  fea- 
ture space  locally  from  a  globally  derived  measure  of  the  output  space  (PDF  estimate). 

The  one-dimensional  example  of  figure  47  further  illustrates  the  entropy  minimizing/ 
maximizing  behavior  of  the  algorithm.  The  figure  shows  a  bi-modal  distribution  (that  has 
presumably  estimated  from  observations)  overlain  on  a  desired  distribution  that  is  uniform 
from  -1  to  1 .  Also  shown  in  the  figure  is  the  gradient  of  the  estimator  kernel  (the  kernel  is 


117 


Figure  46.    Gradient  of  two-dimensional  gaussian  Icemel.  The  Icemels  act  as  attractors  to  low 
points  in  the  observed  PDF  on  the  data  when  entropy  maximization  is  desired. 

a  gaussian).  The  kernel  gradient  is  convolved  with  the  difference  between  the  desired  and 
observed  distributions  to  determine  the  error  direction.  The  resulting  error  direction  is 
shown  in  figure  48.  The  difference  between  the  cases  is  the  sign  of  the  error  direction.  As 
we  can  see  and  would  expect  when  we  are  maximizing  entropy  (top  figure)  the  error  direc- 
tion points  away  from  the  modes  of  the  observed  distribution  while  when  we  are  minimiz- 
ing entropy  (bottom  figure)  the  error  direction  is  to  the  center  of  the  modes.  This 
repulsion/attraction  behavior  extends  to  the  multi-dimensional  case  as  well.  The  bottom  of 
the  figure  illustrates  another  point  with  regards  to  feature  extractions.  As  we  can  see,  when 
we  are  minimizing  entropy,  the  trend  is  to  make  the  observations  more  compact,  a  prop- 
erty which  would  be  useful  for  identifying  a  class. 


118 


Figure  47.  Mixture  of  gaussians  example.  The  estimated  distribution  is  a  mixture 
of  gaussians,  while  the  desired  distribution  is  uniform  between  -1  to  1. 
The  kernel  gradient,  which  will  be  convolved  with  the  difference 
between  the  two  distributions  is  shown  in  dotted  line. 

■i  6  Maximum  F.ntrnpy/ PPA-  An  F.mpirical  Cnniparison 
We  present  some  experiment  results  designed  to  illustrate  the  properties  of  information 
theoretic  feature  extraction  as  compared  to  a  signal  representational  approach.  In  these 
experiments  we  will  compare  a  simple  entropy-maximizing  feature  extractor  to  the  well 
known  principal  components  analysis  (PC'Al  approach  to  feature  extraction.  The  source 
distributions  are  simple  by  design,  but  as  we  shall  see  they  are  sufficient  to  show  the  dif- 
ferences in  the  two  methods. 

We  will  begin  with  the  simple  case  of  a  two  dimensional  gaussian  distribution.  The 
distribution  we  will  use  is  zero  mean  with  a  covariance  matrix  of 


1    0 
0  0.1 


119 


entropy  max 


4  - 


-0.6  -0.4  -0.2  0.0  0.2  0.4  0.6 


entropy  mi n 


Figure  48.  Mixture  of  gaussians  example,  entropy  minimization  and 
maximization.  The  plots  above  show  the  resulting  influence  function 
when  the  kernel  gradient  is  convolved  with  the  observed  distribution 
error.  The  sign  depends  on  whether  we  are  minimizing  (bottom)  or 
maximizing  (top)  entropy. 

The  contours  of  this  distribution  are  shown  in  figure  49  along  with  the  image  of  the 
first  principal  component  features.  We  see  from  the  figure  that  the  first  principal  compo- 
nent lies  along  the  jr,  -axis.  We  draw  a  set  of  observations  (50  in  this  case)  from  this  distri- 


120 


bution  and  compute  a  mapping  using  an  MLP  and  the  entropy  maximizing  criterion 
described  in  previous  sections.  The  architecture  of  the  MLP  is  2-4-1,  indicating  2  input 
nodes,  4  hidden  nodes,  and  !  output  node.  The  nonlinearity  used  is  the  hyperbolic  tangent 
function.  We  are  therefore,  nonlinearly  mapping  the  two-dimensional  input  space  onto  a 
one-dimensional  output  space.  The  right-hand  plot  of  figure  49  shows  the  image  of  the 
maximum  entropy  mapping  onto  the  input  space.  From  the  contours  of  this  mapping  we 
see  that  the  maximum  entropy  mapping  lies  essentially  in  the  same  direction  as  the  first 
principal  components. 


entropy  mapping 


Figure  49.  PCA  vs.  Entropy  -  gaussian  case.  Left:  image  of  PCA  features  shown 
as  contours.  Right:  Entropy  mapping  shown  as  contours. 


This  result  is  expected.  It  illustrates  that  when  the  gaussian  assumption  is  supported  by 
the  data,  maximum  entropy  and  PCA  are  equivalent  from  the  standpoint  of  direction.  This 
result  has  been  recognized  by  many  researchers.  In  fact  the  gaussian  assumption  is  often 
used  as  a  limiting  case  for  maximum  entropy  approaches  [Plumbey  and  Fallside,  1988]. 
These  techniques,  however,  only  examine  the  covariance  of  the  data  in  the  output  space. 


121 
We  are  more  interested  in  the  result  when  the  gaussian  assumption  is  not  correct.  In 
this  case  we  would  not  expect  the  PCA  and  entropy  mappings  to  be  equivalent.  We  con- 
duct a  second  experiment  to  illustrate  this  point  where  we  draw  observations  from  a  ran- 
dom source  whose  underlying  distribution  is  not  gaussian.  Specifically  the  PDF  is  a 
mixture  of  gaussian  modes  with  the  following  form 

pW  =  1 /2(^'(A:,  m,,  Z, ) -I- A?(x,  .fij,  Ej)) 
where  N{x,  m,  Z)  is  a  gaussian  distribution  with  mean  m  and  covariance  X .  In  this 


-0.9 
0.0 

0.9 

0.0 


It  can  be  shown  that  the  principal  components  of  this  distribution  are  the  eigenvectors 
of  the  matrix 


0.05 

0 

_  0 

1.2 

0.05 

0 

0 

0.8 

R  =  ;(!;,+ m|m|  -l-Xj  +  mjmJ) 


0.86  0 
0     1 


with  the  principal  component  vector  parallel  to  the  x^  -axis. 

This  distribution  is  shown  in  figure  50  along  with  its  first  principal  component  feature 
mapping.  The  right  side  of  figure  50  shows  the  image  of  the  maximum  entropy  mapping. 
As  we  can  see  there  are  two  distinct  differences  between  this  mapping  and  the  PCA  result. 
The  first  observation  is  that  the  mapping  is  nonlinear.  The  second  observation  is  that  the 


122 
maximum  entropy  mapping  is  more  tuned  to  the  structure  of  the  data  in  the  input  space.  It 
is  interesting  to  note  that  the  maximum  entropy  mapping  weights  the  tails  of  the  modes 
equally  as  evidenced  by  the  greater  spreading  of  the  contours  for  the  mode  with  the  larger 
eigenvalue,  while  the  PCA  mapping  does  not.  We  can  say  from  observing  the  results  that 
the  maximum  entropy  mapping  is  superior  in  describing  the  underlying  structure  of  the 
data  when  compared  the  PCA  mapping. 


POA  mappins 


entropy  mapping 


Figure  50.  PCA  vs.  Entropy  -  non-gaussian  case.  Left:  image  of  PCA  features 
shown  as  contours.  Right:  Entropy  mapping  shown  as  contours. 


We  consider  one  more  bi-modal  distribution.  The  .setup  is  the  same  as  the  previous 
case  (a  bi-modal  distribution)  with  the  modifications 


1 

0 

0  O.lJ 

0.1 

0 

0 

1 

123 


Consequently,  the  principal  components  are  the  eigenvectors  of  the  matrix 


R  = 


0.62     1 
1     0.62 


with  the  major  principal  component  at  45  degrees  above  the  jig-axis.  As  in  the  previous 
case  we  compare  the  principal  component  mapping  to  the  maximum  entropy  mapping. 
The  results  are  shown  in  figure  51.  Again,  as  in  the  previous  case,  it  is  evident  that  the 
maximum  entropy  mapping  is  better  related  to  the  underlying  structure  of  the  distribution, 
as  it  has  found  the  separate  direction  of  the  individual  modes,  whereas  the  PCA  projection 
has  essentially  averaged  the  directions. 


Figure  51.  PCA  vs.  Entropy  -  non-gaussian  case.  Left:  image  of  PCA  features 
shown  as  contours.  Right:  Entropy  mapping  shown  as  contours. 

These  experiments  help  to  illustrate  the  differences  between  PCA  (a  signal  representa- 
tion method)  and  entropy  (an  information-theoretic  method).  PCA  is  primarily  concerned 
with  direction  finding  and  only  considers  the  second  order  statistics  of  the  underlying  data, 
while  entropy  explores  the  structure  of  the  data  class.  In  a  few  limited  cases,  second  order 


124 

statistics  are  sufficient  (e.g.  gaussian)  to  describe  sucli  structure,  but  in  general  they  are 
not. 


5.7  Maximum  Rntrnpy:  ISAR  F.xperiment  _    j  ?  ^ 

We  now  present  some  experimental  results  using  maximum  entropy  feature  extractor 
for  ISAR  data.  The  mapping  structure  we  use  in  our  experiment  is  a  multi-layer  perceptron 
with  a  single  hidden  layer  (4096  input  nodes,  4  hidden  nodes,  2  output  nodes).  The  method 
is  used  tc  extract  features  from  two  vehicle  types  with  ISAR  images  from  180  degrees  of 
aspect.  Examples  of  the  imagery  are  shown  in  figure  52.  In  these  experiments  the  optimi- 
zation criterion  is  always  to  maximize  entropy  conditioned  on  the  input  class.  The  input 
class  may  be  represented  by  a  single  vehicle  type  or  both  vehicle  types  depending  on  the 
experiment.  This  is  not  how  the  technique  would  be  applied  to  the  NL-MACE  structure 
(recall  that  mutual  information  has  both  an  entropy  minimizing  and  maximizing  term),  but 
the  results  are  interesting  and  further  illustrate  the  potential  of  the  information  theoretic 
approach. 


Figure  52.  Example  ISAR  images  from  two  vehicles  used  for  experiments.  The 

vehicles  were  rotated  through  an  aspect  range  of  0  to  180  degrees.  The 
top  and  bottom  rows  are  from  different  vehicles. 


125 

5.7. 1   Maximum  F.ntrnpy:  Single,  Vehicle  Class 

In  the  first  experiment  we  trained  the  feature  extractor  on  a  single  vehicle  (upper  vehi- 
cle in  figure  52)  over  180  degrees  of  aspect  with  3.6  degrees  aspect  between  each  exem- 
plar. We  show  the  mapping  of  the  input  images  onto  the  two  dimensional  output  feature 
space  in  figures  53,  54,  and  55  after  100,  200  and  300  iterations,  respectively.  The  map- 
ping of  the  images  into  the  feature  space  are  connected  in  order  of  increasing  aspect.  In  the 
latter  two  plots  it  is  clear  that  the  extracted  features  have  begun  to  fill  the  output  feature 
space,  but  have  also  maintained  aspect  dependence  on  the  images.  We  believe  that  this  is 
evidence  that  while  the  method  increases  the  statistical  independence  of  the  two  output 
features,  it  is  still  tuned  to  the  underlying  distortion  of  the  input  vehicle  class  as  repre- 
sented by  rotation  through  aspect. 


feature  space  (training  exemplars)  feature  apace  (testing  exempl 

.0 1  '  ■  '  ■  T ." ,  I  ,  ,  *,  ,  I  i.or^  '  '  '  r '  '  '  '  I  '  '  "  ■  I  '  '  ' 


>:   0.0 


0.0 

y, 


0.0 

y. 


Figure  53.  Single  vehicle  experiment,  100  iterations.  Projection   of  training   (top 
left)  and  testing  (top  right)  images  onto  feature  space. 


We  believe  that  this  is  evidence  that  the  mapping  has  maintained  topological  neighbor- 
hoods in  a  similar  fashion  to  the  Kohonen  self-organizing  feature  map  (SOFM)  [Kohonen, 


A&u: 


126 


feature  space  (training  exemplars)  feature  space  (testing  exemplars 

l.OP     '      '      '      ' 1.01 ""    ~^f-^-l^,^^r- 


>;   0.0 


0.0 


Figure  54.  Single  vehicle  experiment,  200  iterations.  Projection  of  training  (top 
left)  and  testing  (top  right)  images  onto  feature  space.  Adjacent  aspect 
angles  are  connected  by  a  line. 


feature  space  (training  exemplars)  feature  space  (testing  exemplarsj 

1.0 1  ,  ,  .' ,  I  .  I  p ,  I  ,  ,  'i  I  I  1.01  '  ■  ■  ■  r  '  ■  '  '-r '  '  '  ■  i^T-T-T- 


-1.0  -0.5  0.0  0.5  1.0 


-1.0  -0.5  0.0  0.5  1.0 

yi 


Figure  55.  Single  vehicle  experiment,  300  iterations.  Projection   of  training   (top 
left)  and  testing  (top  right)  images  onto  feature  space. 

1995].  The  difference  between  this  approach  and  the  SOFM  approach  is  that  in  this  case 
the  mapping  is  continuous,  whereas  in  the  SOFM  the  samples  are  mapped  onto  a  discrete 


127 
lattice.  The  relationship  of  this  maximum  entropy  mapping  approach  to  the  SOFM  of 
Kohonen  is  a  topic  that  will  be  left  for  later  research. 

"^.7.2  Maximum  F.ntrnpy:  Two  Vp.hiriF,  Classes 
'  It  is  commonly  assumed  in  the  blind  source  separation  problem  that  the  sources  are 

"^    ■  statistically  independent  [Bell  and  .Sejnowski,  1995].  Maximum  entropy  has  been  used  in 

approaches  to  this  problem.  As  an  example  of  blind  source  separation  we  repeat  the  previ- 
ous experiment  on  both  vehicles,  which  can  be  modeled  as  statistically  independent 
sources,  from  figure  52.  The  projection  of  the  training  images  (and  between  aspect  testing 
■^  images)  is  shown  in  figure  56  (where  adjacent  aspect  training  images  are  connected).  As 

■*  can  be  some  significant  class  separation  is  exhibited  (without  a  priori  labeling  the  classes). 

'-  ■  In  the  early  stages  of  learning  the  method  appears  to  maximize  information  with  regards  to 

^  the  underlying  distortion  common  to  both  classes  (rotation  through  aspect).  As  the  map- 

ping is  refined  the  information  begins  to  focus  on  the  differences  between  the  classes. 

5.S  Compntational  Simplification  of  the  Algorithm 
So  far  we  have  only  presented  results  using  the  method  to  maximize  entropy.  Our 
interest  with  regards  to  classification  is  mutual  information.  Specifically  as  described  by 
equation  57  where  the  mutual  information  is  a  function  of  the  observed  output  entropies. 
However,  before  we  discuss  extensions  to  mutual  information  we  present  some  significant 
computational  aspects  of  our  method.  There  have  been  other  techniques  to  entropy  manip- 
ulation of  continuous  random  variable  proposed.  However,  the  methods  either  oversim- 
plify (assume  Gaussianity  or  unimodal  pdfs  [Bell  and  Sejnowski,  1995,  Plumbey  and 
■":■  Fallside,  1988])  or  are  overly  complex  (Edgeworth  expansions  [Wong  and  Blake,  1994]). 


128 


feature  space  (training  exemplars 


feature  space  (testing  exemplars) 

e  I — ^1 — 1 — I— 1 — >*-i — I — I — I — I — I — I  I  o  I  '  ■  1  ■  '  ^ 


-0.6      -0.4      -0.2       0.0         0.2         0.4         0.6 

feature  space  (training  exemplars 

.61 <^  I    I   .  1   .    I    .    ,    .f  I    .    I    I    ^.    i-| 


-0.6      -0.4 


-0.2       0.0        0.2        0.4 


-0.6      -0.4      -0.2       0.0        0.2        0.4        0.6 


0.6 

^  1     1    .   ^    .,...- n     ,    , 

0.4 

^*f*1*Sfe^-.,. 

C-"'K'- 

0.2 

a     0.0 

-0.2 

-0.4 

06 

, . .  ,,,,,.. 

-0.2       0.0        0.2 

yi 


Figure  56.  Two  vehicle  experiment.  Projection  of  training  (left)  and  testing  (right) 
images  onto  feature  space  after  150  (top)  and  300  (bottom)  iterations 
for  two  vehicle  class  training  Vehicle  1  is  indicated  by  diamond 
symbols,  while  vehicle  2  is  indicated  by  triangles.  Each  class  is 
connected  in  order  of  aspect  angle.  It  appears  in  these  figures  that  the 
mapping  has  maintained  aspect  dependence  for  each  vehicle.  At  the  300 
iteration  point  some  separation  of  the  vehicles  is  in  evidence.  In  the 
bottom  left  plot,  the  connecting  lines  have  been  removed  in  order  to 
better  show  the  class  separation  which  has  taken  place 


*». 


129 
In  contrast,  the  method  here  is  fairly  straightforward  and  as  we  will  show  computationally 
simple.  The  results  of  this  section  greatly  reduce  the  computational  complexity  of  our 
sT^  approach  and  yield  a  surprisingly  simple  and  intuitive  perspective  of  mutual  information. 

i<  _     "  Again  we  consider  equation  74,  where  we  have  already  observed  that  the  implicit  error 

W  direction  is  the  convolution  of  the  observed  distribution  error  with  the  kernel  gradient.  We 

illustrate  this  by  rewriting  the  implicit  error  direction  term  associated  with  each  observa- 
tion J",  (excluding  the  term  related  to  mapping  sensitivities  and  neglecting  the  sign  for  the 
moment)  as 

e,  =  ^EyiUj\{y})K'(y,-Uj)Au  (79) 

J 

where  E^(u|{v})  indicates  the  observed  distribution  error  at  point  u.  estimated  over  the 
set  of  observations  {y  )  . 

■  At  first  glance  it  would  seem  that  the  method  is  computationally  expensive.  Computa- 
tion of  the  Parzen  window  estimate  is  itself  of  order  N^,N^ ,  the  number  of  observations 
multiplied  by  the  number  of  locations  in  the  output  space  at  which  the  estimator  is  com- 
puted. Reasonable  estimates  of  the  density  using  a  discrete  approximation  requires  N^  to 
increase  exponentially  with  the  dimension  of  the  output  space.  Furthermore,  from  equa- 
tion 79,  in  order  to  compute  the  implicit  error  term  we  multiply  the  complexity  of  the  com- 
putation by  A'„ ,  yet  again,  to  yield  an  overall  complexity  of 


WX'*'.  (80) 


130 
where  N^  is  the  dimension  of  the  output  space  and  A'„  is  with  respect  to  a  one  dimen- 
■«■  sional  output  space.  Furthermore  in  the  equation,  we  set  N^  =  ^Ny  in  order  to  get  an  accu- 

''^'^  rate  estimate  of  the  implicit  error  term,  that  is  the  sampling  grid  on  the  order  of  three  times 

==*  as  dense  as  the  data  observations  (assuming  gaussian  kernels  [Hardle,  1990]).  Using  this 

'■>  rule  of  thumb,  the  order  of  the  computational  complexity  as  a  function  of  the  output 

~'  dimension  and  the  number  of  observations  becomes 

^'^'V;"'  (81) 

■  :  ■  Fortunately,  the  dimensionality  of  the  output  space  is  controlled  by  the  designer,  how- 

>•*;•  ever,  equation  81  poses  a  fundamental  computational  limitation  to  the  dimensionality  of 

-* .  the  subspace  mapping.  This  limitation,  however,  is  only  valid  if  the  implicit  error  term  is 

**  computed  in  the  straightforward  manner  that  the  equations  imply.  Further  examination  of 

^  the  Parzen  window  estimator  shows  how  this  complexity  can  be  greatly  simplified.  The 

final  result  reveals  that  the  implicit  error  tenn  can  be  computed  purely  as  a  function  of  the 
local  interaction  between  the  observations  in  the  output  space. 

The  Parzen  window  estimator  is  the  convolution  of  the  kernel  with  the  data,  therefore 
we  can  rewrite  equation  79  as 

E,  =  ej'(«|{3'})*K'(«)| 

=  (/K(«)-/y(«K>}))*K'(«)|„=j,,  .  (82) 

=  (/j.(«)-y(«)*K(«))*K'(«)|„^^^ 


131  -  •  ■ 

where  the  term  .  .       .  "^ 

N, 

1=  1 
represents  the  data  set  as  observed  in  the  output  space,  fy(u)  is  the  desired  output  distri- 
bution (uniform),  and  /y-(«  |  {3- } )  is  the  observed  output  distribution  estimated  over  the  set 
{j}.  Continuing  from  the  last  step  of  82, 

=   (/y(«)*K'(«))-v(«)*K(«)*K'(«)|„^^_ 

=  /.(«) -y(«)*K„(«)|„^^^  .  (83) 

=  friyj)-'L^a(yi-yj^ 

The  terms  k^(«)  and  f^(u)  will  be  termed  the  attractor  kernel  and  the  topology  regu- 
lating term,  respectively,  for  reasons  which  will  become  clear.  The  significance  of  equa- 
tion 83  is  that  it  breaks  the  fundamental  limitation  imposed  by  equation  81.  Both  the 
attractor  kernel  and  topology  regulating  terms  can  be  computed  analytically.  The  implicit 
error  term  is  therefore  the  negative  of  the  convolution  of  the  attractor  kernel  with  the  data 
(as  it  is  projected  into  the  output  space).  More  importantly  the  computational  complexity 
of  equation  82  is  only  of  order  N^  for  each  v, .  The  total  computational  complexity  is 
therefore  significantly  reduced  from  that  of  equation  78. 


132  ■-..,,  .. 

In  section  A.3  of  the  appendix,  the  analytic  form  of  K^  is  derived  for  the  A' -dimen- 
sional gaussian  kernel  with  covariance  matrix  of  the  form  ' 

r  =  a^I. 
The  result  from  equation  1 12  in  the  appendix  is 


k:„(u)  =  K(u)*K'(«) 

V2       jt      o       >*      ^  4a 


fm 


where  N  is  the  dimension  of  the  kernel.  . 

In  section  A.4  of  the  appendix,  the  analytic  form  of /^  is  also  derived.  The  result  from 
equation  1 20  is 


/,(«) 


-n- 

i*  1 


-n- 


f 

f"'"2 
72a 

f"'-2 
V         JJ 

erf 

-erf 

('^i(«i+|aj-K,[^»,-|a 


erf 


J2a 


-erf 


72o 


'^l("N  +  |ar)-'^i(«N-|o]] 


(85) 


where  the  desired  distribution  is  uniform  in  the  hypercube  centered  at  zero  with  vertices  of 
size  a,  N  is  the  dimension  of  the  kernel  (and  the  output  space),  and  K,  (m,  a)  is  the  one 

dimensional  gaussian  kernel  with  mean  m  and  standard  deviation  a .  These  functions  are 
shown  for  the  two  dimensional  case  in  figures  57  and  58. 


133 


Figure  57.  Two  dimensional  attractor  functions.  The  x^  -component  is  shown  at 
top  while  the  a:2 -component  is  shown  at  bottom.  The  function 
represents  the  influence  of  each  data  point  on  its  locale  in  the  output 
space.  As  in  the  analysis  of  the  kernel  gradients,  we  see  that  this 
function 

The  magnitude  of  the  regulating  function  is  shown  in  figure  59.  It  is  evident  from  the 
figure  that  the  regulating  function  only  has  influence  at  the  boundaries  of  the  desired  out- 
put distribution.  Furthermore,  examination  of  the  equation  85  shows  that  the  topology  reg- 
ulating function  contains  an  erf([  ])  function  evaluation  when  the  output  space  is  greater 
than  one  dimension.  From  a  computational  standpoint  this  function  evaluation  can  be 
costly.  This  term  is  essentially  unity  except  at  the  vertices  of  the  hypercube.  Figure  60 
shows  an  approximation  of  equation  85  minus  the  erf((  ])  evaluation.  As  can  be  seen  in 
the  figure,  within  the  region  of  support  of  the  desired  distribution,  the  function  is  essen- 
tially unchanged.  If  we  match  the  region  of  support  of  the  desired  output  distribution  to  the 


134 


Figure  58.  Two  dimensional  regulating  function.  The  atj -component  is  shown  at 
top  while  the  X2  -component  is  shown  at  the  bottom. 

mapping  topology,  the  approximation  can  be  used  in  order  10  save  significant  computa- 


tion. 

Ml' 

^ 

■■•  ^  ■  <■  . 

ij' 

^^^^1 

i 

'■'i 

:^ 

4 

«i^ 

-^.v^-^^H 

pp^^ 

I 

Figure  59.  Magnitude  of  the  regulating  function.  The  magnitude  of  this  function  is 
zero  except  near  the  boundary  of  the  desired  output  distribution. 


■w ' 


.1 

135 

1 

% 

^^^^^^\»vi^ 

1 

^ — C\^ 

Figure  60.  Approximation  of  the  regulating  function.  The  figure  shows  the 
regulating  function  when  the  erf(  )  is  ignored.  The  change  is  not 
significant  within  the  region  of  support  of  the  desired  di.stribution. 

The  result  of  this  analysis  is  that  the  manipulation  of  entropy  can  be  modeled  as  local 
interactions  of  the  observed  data  in  the  output  space.  The  function  of  the  attractor  kernel, 
^a^[  ]) .  's  to  model  the  interaction  of  the  data  points  with  each  other,  while  the  function 
of  the  topology  regulating  term,  /^([   ]) ,  is  to  model  the  interaction  between  the  data 


136 


points  and  the  constraints  of  the  desired  output  distribution.  Furthermore,  if  the  mapping 
topology  is  matched  to  the  desired  output  distribution,  the  evaluation  of  the  topology  reg- 
ulating term  can  be  further  simplified.  The  final  algorithm  complexity  has  been  reduced 
substantially  as 


0(ivJ'''*')-^0(ivJ) 


(86) 


S  Q  rnnversinn  of  Implicit  F.rrnr  nirertinn  to  an  F.xplicit  Hrror 
In  the  previous  section  we  derived  a  method  which  greatly  simplified  the  complexity 
of  the  error  direction  computation.  In  the  process,  manipulation  of  a  global  property, 
entropy,  was  seen  to  be  a  process  of  local  attraction/repulsion  of  the  individual  observa- 
tions in  the  output  space.  This  idea  of  maximizing  and  minimizing  entropy  and  ultimately 
mutual  information  through  local  interactions  can  be  further  extended  such  that  the  com- 
puted error  direction  can  be  converted  into  an  implicit  desired  signal.  That  is,  we  can  go 
from  an  unsupervised  learning  algorithm  to  one  which  is  supervised,  in  a  step-wise  fash- 
ion. The  resulting  simplification  to  the  algorithm  is  that  we  no  longer  need  to  estimate  the 
error  direction  for  every  gradient  step.  j 

S  Q  1   F.ntrnpy  Minimization  as  Attraction  to  a  Point 

We  begin  with  entropy  minimization  which  is  modeled  as  local  attraction  between  the 
data  points.  In  figure  48,  the  bottom  plot  indicates  that  the  points  are  attracted  to  the  center 
of  the  observed  distribution  modes,  with  the  degree  of  attraction  being  stronger  for  the 
larger  mode.  As  we  have  stated,  however,  the  influence  function  is  in  reality  a  direction.  If 
a  proper  scale  factor  can  be  found  then  the  error  direction  can  be  equated  to  an  actual  error 
and  a  desired  signal. 


137  ,.-.-. 

The  extent  of  the  attraction  field  between  points  is  directly  proportional  to  the  kernel 
size  as  represented  by  a  in  equation  84.  In  the  equation  we  also  see  that  the  degree  of 
attraction  is  inversely  proportional  to  0'^  +  ^ ,  where  N  is  the  dimension  of  the  kernel.  So 
as  the  kerne!  size  decreases  the  degree  of  attraction  increases  dramatically. 

Again,  referring  to  figure  48,  attraction  to  a  point  makes  sense  from  an  intuitive  stand- 
point with  regards  to  minimizing  entropy.  We  also  recognize  that  the  influence  of  all  of  the 
points  is  additive.  So  in  order  to  ensure  that  the  net  attraction  is  to  a  point,  we  simply  set 
the  gradient  at  the  center  of  the  attractor  kernel  to  unity.  The  scale  factor  as  a  function  the 
kernel  size  and  dimension  is  solved  for  in  section  A.3  of  the  appendix  with  the  result 


3k„(«) 


du 


-( ! 


(87) 


1-11  =  0 


Figure  61  illustrates  three  cases  of  scaling  the  attractor  kernel  for  one  dimension.  We 
can  see  in  the  figure,  when  the  attractor  is  scaled  such  that  the  slope  is  less  than  or  equal  to 
unity  we  will  get  stable  attraction  to  a  point.  As  a  result  when  minimizing  entropy  we  are 
able  to  compute  an  explicit  desired  output  as  function  of  the  current  configuration  of  the 
observations  in  the  output  space.  This  allows  us  to  train  a  multi-layer  perceptron  in  a 
supervised  fashion.  When  the  MSE  of  the  error  is  reduced  satisfactorily  we  can  compute  a 
new  desired  signal  based  on  the  new  configuration  of  the  observations  in  the  output  space. 

One  question  which  remains  is  how  to  set  the  size  of  the  kernel.  Towards  that  goal  we 
note  that  figure  61  has  been  normalized  by  the  kernel  size,  a,  and  by  virtue  of  our  scale 
factor  this  plot  can  be  extended  to  the  multidimensional  case  as  well.  The  field  of  influ- 


i!i! 


138 


undershoot 


1.0 

0.5 

0.0 

-0.5 
-1.0 


'     • 

1    ■ 

'     '  J^ 

'     1 

' 

■  1  ' 

■     1 

■    _ 

L-'^*'^ 

1 

1 

I 

'p 

^-- 

"^" 

— 

L 

1 

1 

L. 

1 
1 

1 
1 

-4' 

-J^ 

T" 

- 

1 

1 

^ 

1 

_ 

- 

1     . 

,  1 

.     1     . 

,      1 

~ 

-2  0  2 

it  v«r 

slope  normalized 


Figure  61.  Feedback  functions  for  implicit  error  term  Undershoot  condition  (top), 
slope  normalized  (middle),  overshoot  (bottom). 


139 


ence  is  essentially  zero  when  the  distance  from  the  center  of  the  kernel  is  greater  than  3a , 


\(y-y.)\>3a.  (88) 

The  process  relies  on  local  interaction,  and  so  from  an  attraction  viewpoint  we  can  use 
the  two  most  distant  nearest  neighbors  to  set  the  kernel  size  (and  adapt  it  during  the  learn- 
ing process).  Stated  mathematically, 


<W) 


/max^   /mm>   , 


From  a  practical  standpoint,  the  mutual  distances  between  each  point  must  be  computed  in 
the  course  of  evaluating  the  kernels,  and  so  equation  89  does  not  represent  a  significant 
additional  burden. 

Figure  62  shows  an  example  of  using  local  attraction  to  minimize  entropy.  In  the  fig- 
ures there  are  two  clusters  of  points.  By  choosing  the  kernel  according  to  the  maximum 
nearest  neighbor  distance,  the  points  within  the  local  clusters  are  attracted  to  a  point.  We 
also  see  that  the  clusters  are  converging  to  a  single  point  in  the  third  iteration. 


S  9  ?  F.ntropy  Maximi/ation  as  Diffusion 

If  instead,  the  goal  was  maximum  entropy,  then  the  local  interaction  becomes  repul- 
sion and  the  feedback  terms  of  figure  61  point  in  the  opposite  direction.  We  can  use  the 
idea  of  uniform  diffusion  in  the  output  space  in  order  to  set  the  kernel  size  for  entropy 
maximization.  In  the  early  stages  of  learning  we  would  like  the  relative  kernel  size  to  be 
large.  In  this  way,  dense  groupings  of  points  will  maximally  interact  (and  repel),  however. 


140 


iteration  1 

0.4 

- 

1    y 

/■ 

0.2 

- 

!> 

•^ 

-0.0 

- 

- 

*~«^=^ 

-0.2 
-0.4 

L 

0 

- 

.0 

-0.5                    0.0 

0.5                        1 

iteration  2 

0.4 

- 

0.2 

- 

1/ 

-0.0 

- 

-0.2 

- 

7' 

-0.4 

r 

/ 

0 

-1 

.0 

-0.5                    0.0 

0.5                      1 

iteration  3 

0.4 

: 

0.2 
-0.0 

: 

-0.2 

/ 

- 

-0.4 

9 

- 

0 

-1 

.0 

-0.5                    0.0 

0.5                      1 

Figure  62.  Entropy  minimization  as  local  attraction.  The  figures  above  show  three 
iterations  of  the  local  attraction  algorithm.  The  two  groups  of  points  are 
seen  to  be  attracted  to  their  local  means  as  well  as  to  each  other. 

towards  the  later  stages  of  learning  we  would  like  the  interaction  to  decrease  to  a  negligi- 
ble level  as  the  distribution  approaches  a  uniformity. 

If  we  make  the  assumption  that  the  observed  distribution  (which  we  no  longer  com- 
pute in  the  local  interaction  framework)  will  eventually  approach  uniformity,  we  have  a 
basis  for  setting  an  upper  bound  on  the  size  of  the  kernel.  Given  Ny  points  in  an  N^- 


141  •      '■    ■ 

dimensional  space,  uniformly  spaced  in  a  hypercube,  the  distance  between  nearest  points 
approaches 


Kf' 


(90) 


where  V  is  the  volume  of  the  hypercube.  The  upper  bound  of  the  kernel  size  can  be  set 
proportionally  to  this  value.  During  training,  the  kernel  size  can  be  set  so  as  to  ensure  local 
interaction  subject  to  the  upper  bound. 

Figure  63  shows  an  example  result  when  entropy  maximization  is  modeled  as  diffu- 
sion. The  upper  bound  on  the  kernel  size  was  set  to  1/3  of  equation  90.  Subject  to  the 
upper  bound  the  kernel  size  was  adaptively  set  to  1/2  the  maximum  nearest  neighbor  dis- 
tance. One  interesting  observation  is  that  near  the  center  of  the  figure,  the  data  (diamond 
symbols)  have  arranged  themselves  in  a  hexagonal  configuration,  which  is  well  known  to 
be  the  most  efficient  sampling  scheme  in  two  dimensions. 

5.9.3  Stopping  r.rite.rinn 

Figure  63  brings  up  one  final  subject  in  the  local  interaction  viewpoint.  The  original 
optimization  criterion  was  the  integrated  squared  error  between  the  observed  distribution 
and  the  desired  uniform  distribution.  Since  the  PDF  estimation  was  bypassed,  we  no 
longer  have  access  to  the  criterion  while  training.  Consequently,  we  need  a  proxy  for  the 
criterion  in  order  to  determine  when  to  .stop  the  training.  We  propose  the  following  mea- 
sure as  a  substitute 


max(A^^)  -  min(A^;y) 
max(A) 


(91) 


142 


Figure  63.  Entropy  maximization  as  diffusion.  The  data  points  are  plotted  as 
diamonds  in  the  figure  above.  PDF  estimation  locations  are  shown  as 
plus  signs.  We  see  that  near  the  center  of  the  distribution  that  the  points 

,  have  arranged  themselves  in  a  hexagonal  configuration,  known  to  be     - 

the  most  efficient  sampling  scheme  in  two  dimensions. 

where  max{A;^^.)  is  the  maximum  nearest  neighbor  distance  (which  we  are  already  keep- 
ing track  of,  min(A^^)  is  the  minimum  nearest  neighbor  distance  and  max(A)  is  the 
maximum  distance  between  any  two  points.  The  numerator  term  measures  how  equally 
spaced  the  points  are  and  is  expected  to  approach  zero  as  the  distribution  becomes  more 
uniform.  The  denominator  term  is  a  penalty  term  for  not  filling  the  entire  space.  This  mea- 
sure is  shown  for  the  previous  diffusion  example  along  with  the  integrated  squared  error 
measure  and  entropy  in  figure  64.  Both  the  integrated  squared  error  and  entropy  measures 


143 

were  computed  from  a  sampled  estimate  of  the  observed  PDF.  The  estimation  locations 
are  represented  by  the  cross  symbols  in  the  figure  63. 

As  we  can  see,  both  the  integrated  squared  error  measure  and  the  measure  of  equation 
91  are  adequate  estimates  of  the  entropy.  Equation  91,  however,  is  much  less  computation- 
ally expensive  than  the  other  two. 


stopping  criterion 

1.000 
0.100 
0.010 
0.001 

"A 

:      ^\  \ 

s 
-               \ 

E             \ 

h{Uj 

- 

itylu)^ 

max(A^^)  -  min{A^yy) 

^^.^ -^ 

: 

■ 

- 

c 

20 

40                     60 
iter 

80 

100 

Figure  64.  Stopping  criterion.  Comparison  of  entropy,  integrated  squared  error, 
and  distance  derived  stopping  criterion.  Integrated  squared  error  and 
the  distance  derived  criterion  are  reasonable  approximations  to  the 
criterion  of  interest,  entropy. 


5  10  OhsCTvatinns 
We  have  described  a  nonparametric  approach  to  information  theoretic  feature  extrac- 
tion. We  believe  that  this  method  can  be  used  to  improve  classification  performance  by 
directly  choosing  relevant  features  for  classification  via  maximization  of  mutual  informa- 
tion. A  critical  capability  of  the  information  theoretic  approach  is  the  ability  to  adapt  the 
entropy  of  the  output  space  of  the  nonlinear  projection  entropy.  We  have  shown  that 


144 
through  the  use  of  a  simple  differentiable  estimator,  namely  Parzen  windows,  that  the 
adaptation  of  entropy  can  fit  logically  into  the  error  backpropagation  model.  This  method 
differs  from  other  entropy  based  approaches  such  as  using  the  Kullback-Leibler  norm  for 
supervised  learning. 

We  have  also  presented  experiments  that  illustrate  the  usefulness  of  this  technique. 
Comparisons  to  the  well  known  PCA  method  show  thai  the  information  theoretic 
approach  is  more  sensitive  to  the  underlying  data  structure  beyond  simple  second-order 
statistics.  The  data  types  used  for  the  experiments  were  simple  by  design.  They  served  to 
illustrate  the  usefulness  of  the  method  even  for  seemingly  simple  problems. 

We  have  also  shown  how  the  approach  can  be  modeled  as  local  interaction  of  the  data 
.  in  the  output  space.  This  viewpoint  led  to  a  significant  computational  savings  as  well  as  a 
clearer  intuitive  understanding  of  the  algorithm. 

■5.1 1    Mntnal  Infnrmatinn  Applied  to  the  Nonlinear  MACF.  Filters 
At  this  point  we  present  experimental  results  which  illustrate  application  of  this  tech- 
nique to  the  nonlinear  MACE  filter.  This  is  accomplished  by  repeating  the  experiments  III 
(section  4.6.3)  and  IV  (section  4.6.4)  from  the  previous  chapter. 

In  experiment  III  we  trained  the  classifier  (after  pre-processing  the  imagery)  with  sub- 
space  projected  noise  exemplars  for  the  rejection  class  and  gram-schmidt  orthogonaliza- 
tion  on  the  input  layer.  The  orthogonality  constrain  ensured  that  the  feature  would  be 
uncorrelated  over  the  rejection  class.  In  this  experiment  we  remove  the  orthogonality  con- 
straint and  decouple  the  feature  extraction  from  the  discriminant  function  explicitly.  The 
images  are  still  be  pre-processed  and  the  same  exemplars  are  used  to  train  the  system, 
however,  the  classifier  architecture  will  be  trained  on  the  output  of  the  feature  extractor. 


145 


The  goal  is  to  maximize  mutual  information  conditioned  on  the  recognition  class  or 


IiC,y)  =  h{y)-h{y\C) 
I(C,g{x,a))  =  h(gix,a)}-h(g{x,a)\C)- 


(92) 


where,  x  is  the  pre-processed  training  exemplar  and  y  is  the  extracted  feature  vector 

The  feature  extraction  architecture  is  an  N^N2-4-2  MLP  (4  hidden  nodes,  2  output 
nodes).  The  resulting  feature  space  mapping  is  shown  in  figure  65.  In  contrast  to  the 
results  of  section  4.6,  the  feature  space  is  the  output  of  a  nonlinear  mapping  and  so  it  is 
difficult  to  make  other  than  qualitative  comments  about  it.  We  can,  however,  say  much 
about  the  criterion  from  which  it  was  derived  (and  we  have).  In  this  case  we  are  left  with  a 
performance  comparison  to  the  previous  experiments.  A  summary  of  the  results  of  this 
section  with  those  of  section  4.6  ar  given  in  table  4.  Where  we  can  see  that  the  perfor- 
mance is  comparable  (slightly  better)  than  in  experiment  III  from  the  previous  chapter, 

which  used  the  same  noise  class  exemplars  with  the  orthogonality  constraint. 

Table  4.  Comparison  of  ROC  classifier  performance  for  to  values  of  Pj.  Results  are 
shown  for  the  linear  filter  versus  experiments  III  and  IV  from  section  4.6  and  mutual 
information  feature  extraction.The  symbols  indicate  the  type  of  rejection  class 
exemplars  used.  N:  white  noise  training,  G-S:  Gram-Schmidt  orthogonalization, 
subN:  PCA  subspace  noise,  C-H:  convex  hull  rejection  class. 


Pd(%) 

Pfa(%) 

linear  filter 

section  4.6 

mutual  information 

(subN,  G-S) 

(subN,  G-S,  C-H) 

(subN) 

(subN,C-H) 

80 

4.37 

2.81 

2.45 

2.65 

1.36 

99 

42.43 

26.52 

15.33 

23.09 

11.07 

The  resulting  ROC  curve  is  shown  in  figure  66  as  compared  to  the  linear  MACE  filter. 
IT  is  not  surprising  that  the  performance  did  not  exceed  the  performance  of  experiment  III 
when  we  consider  how  the  rejection  class  was  generated  -  as  a  random  projection  of  gaus- 


146 


sian  noise  onto  an  ortho-normal  basis.  As  we  have  shown  in  previous  experiments,  under 
the  gaussian  condition  (equal  covariances),  orthogonality  and  entropy  are  equivalent. 
These  results  then  give  support  to  this  technique  since  orthogonality  was  not  enforced  on 
the  feature  extractor.  "*  •  ■ 


3"      0.0 


3       0.0 


-0.5 


-1.0 


-1.0 


-0.5 


0.0 


0.5 


1.0 


Figure  65.  Mutual  information  feature  space.  Rejection  class  is  represented  with 
sub-space  noise  images.  The  top  figure  shows  the  training  exemplars 
(plus  sign  is  recognition  and  triangles  are  rejection),  while  the  bottom 
side  shows  the  testing  set. 


147 


In  the  second  experiment  we  repeat  the  conditions  of  experiment  4.  In  this  case  we 
represent  the  rejection  class  with  both  subspace  noise  and  convex  hull  exemplars.  The 
gaussian  assumption  is  now  no  longer  correct  due  to  the  inclusion  of  the  convex  hull 
exemplars  In  this  case  we  would  expect  the  results  to  improve  on  the  assumption  that  there 
is  information  to  be  extracted  from  the  convex  hull  exemplars  with  regards  to  classifica- 
tion. As  we  have  already  demonstrated  that  the  convex  hull  approach  yielded  improved 
classification  in  the  previous  chapter  we  will  hold  this  assumption  to  be  true.  The  feature 
space  for  this  case  iS  shown  in  figure  67.  We  observe  the  feature  space,  as  in  the  previous 
chapter  is  quite  different.  Intuitively  the  result  makes  sense  in  the  context  of  mutual  infor- 
mation. Recall  that  the  convex  hull  exemplars  lie  in  the  interior  of  recognition  class  in  the 
input  space  (due  to  their  construction)  and  our  goal  via  mutual  information  is  to  make  the 
recognition  class  compact  and  the  rejection  class  diffuse.  This  goal  and  the  property  of  the 
convex  hull  are  seemingly  at  odds.  As  a  result  a  trade-off  results.  The  recognition  class  is 
compact  on  an  ellipsoid  with  the  convex  hull  exemplars  on  the  interior,  but  expanded. 

We  see,  from  table  4  that  the  classification  results  are  substantially  better  than  the  pre- 
vious results.  The  ROC  curve  for  this  experiment  is  shown  in  figure  68  as  compared  to  the 
linear  system.  We  would  hypothesize  that  mutual  information  performed  better  at  extract- 
ing discriminating  information  from  the  exemplars.  We  also  note  that  both  results  did  not 
rely  on  orthogonality  in  the  input  layer  for  which  we  can  only  make  second-order  statisti- 
cal justifications  and  yet  were  able  to  achieve  the  same  or  better  performance. 


148 


Figure  66.  ROC  curves  for  mutual  information  feature  extraction  (dotted  line) 
versus  linear  MACE  filter  (solid  line). 


149 


r" 


i.or 


0.5 


3~         0.0 


-0.5  - 


-1.0 


1.0 


0.5 


0.0 


-0.5 


-1.0 


-1.0 


-1.0  -0.5  0.0  0.5  1.0 

u, 


0.9^ 


Figure  67.  Mutual  information  feature  space  resulting  from  convex  hull 
exemplars.  The  training  exemplars  are  shown  in  the  top  figure  (square  - 
recognition,  triangle  -  rejection).  The  bottom  figure  shows  the  testing 
exemplars. 


150 


Figure  68.  ROC  curves  for  mutual  information  feature  extraction  (dotted  line) 
versus  linear  MACE  filter  (solid  line). 


CHAPTER  6 
CONCLUSIONS 

We  have  discussed  a  methodology  by  which  linear  distortion  invariant  filtering  can  be 
extended  to  nonlinear  systems.  The  extension  to  nonlinear  systems  was  initiated  by  first 
establishing  the  link  between  distortion  invaiiant  filters  and  the  linear  associative  memory 
in  chapter  3.  The  linear  associative  memory  perspective  is  important  in  that  it  more  closely 
aligns  distortion  invariant  filtering  with  classification.  Advances  in  distortion  invariant  fil- 
tering, as  described  in  chapter  2,  have  occurred  within  a  linear  systems  framework  despite 
the  primary  application  being  classification.  The  result  is  a  classification  approach  which 
considers  only  second  order  statistics.  In  contrast  the  development  of  associative  memo- 
ries has  occurred  within  a  probabilistic  framework  emphasizing  a  classification  approach 
which  considers  the  underlying  probability  density  function.  This  perspective  led  naturally 
to  nonlinear  signal  processing.  The  consequence  of  using  the  MSE  criterion  was  also  dis- 
cussed in  chapter  3.  The  result,  which  has  been  shown  by  other  researchers  as  well,  was 
that  the  MSE  criterion  combined  with  a  universal  approximator  and  1-of-N  coding  (the 
desired  output  is  an  N-vector  with  the  desired  output  for  the  ith  element  set  to  unity  and  all 
others  to  zero  for  an  N-class  classification  problem)  is  suitable  for  estimating  posterior 
class  probabilities. 

Some  of  the  major  contributions  of  this  dissertation  were  presented  in  chapter  4.  We 
began  with  an  analysis  of  commonly  used  measures  of  generalization  for  distortion  invari- 
ant filters.  Our  analysis  showed  that  these  measures  were  actually  counter  to  good  classifi- 


151 


152  I 

cation  performance.  It  is  our  opinion  that  the  generalization  measures  discussed  are  more 
properly  suited  to  a  signal  representation  framework  and  not  classification.  The  analysis 
also  revealed  that  emphasis  on  the  MACE  filter  optimization  criterion  in  the  construction 
of  the  OTSDF  led  to  superior  classification  performance. 

The  results  of  the  analysis  of  generalization  measures  was  significant  in  that  it  high- 
lighted the  fact  that  commonly  used  measures  of  generalization  should  not  be  the  sole 
basis  upon  which  to  compare  nonlinear  systems  to  the  their  linear  counterparts  since  these 
measures  are  only  weakly  coupled  to  classification  performance. 

The  probabilistic  viewpoint  of  the  MACE  filter  optimization  criterion  was  presented 
in  chapter  4  as  well.  Within  this  framework,  nonlinear  mappings  such  as  the  multi-layer 
perceptron,  were  included  allowing  for  a  nonlinear  extension  of  the  MACE  filter.  The  lack 
of  closed  form  analytical  solutions  for  general  nonlinear  mappmgs  necessitated  an  itera- 
tive approach.  Consequently,  the  feed-forward  multi-layer  perceptron  was  an  obvious  can- 
didate for  the  nonlinear  mapping  due  to  its  property  as  a  universal  function  approximator 
coupled  with  computationally  efficient  iterative  algorithms.  This  choice  also  preserved  the 
shift  invariance  property  of  the  original  linear  filter. 

Several  developments  resulted  from  the  nonlinear  approach.  An  efficient  training 
algorithm  resulted  from  the  recognition  that  the  optimization  criterion  was  equivalent  to 
characterizing  the  rejection  class  by  white-noise  images  in  the  pre-whitened  image  space. 
The  results  of  experiment  I  in  section  4.6. 1  emphasized  the  need  for  suitable  performance 
measures  by  which  to  compare  nonlinear  and  linear  classifiers.  This  motivated  a  feature 
space  viewpoint  of  the  internal  mappings  of  the  multi-layer  perceptron.  Examination  of 
the  feature  space  led  to  several  modifications  and  subsequent  performance  improvements 


153  J  ■     \  .  .      . 

to  the  training  algoritlim  and  classification  performance.  Specifically,  an  orthogonality 
constraint  on  the  input  layer  of  the  multi-layer  perceptron  was  sufficient  to  guarantee 
uncorrelated  features  over  the  rejection  class.  Projection  of  the  white  noise  rejection  class 
exemplars  onto  the  space  of  the  recognition  class  data  matrix  effectively  reduced  the 
dimensionality  of  the  problem  from  N^Nj  (the  image  size)  to  N,  (the  number  of  recogni- 
tion class  exemplars).  The  result  of  this  modification  was  a  significantly  faster  conver- 
gence rate.  The  last  result  borrowed  the  idea  of  using  the  interior  of  the  convex  hull  (over 
the  recognition  class  exemplars)  as  representative  of  the  rejection  class.  This  is  a  further 
refinement  of  the  concept  of  reducing  the  intrinsic  dimensionality  of  problem.  There  were 
two  observations  concerning  the  convex  hull  approach.  Convergence  times  were  consider- 
ably longer  and  the  stability  of  the  iterative  procedure  became  an  issue.  However,  when 
the  training  did  converge,  the  classification  performance  was  superior  to  the  previous 
cases.  We  feel  that  the  results  of  the  convex  hull  approach  merit  further  investigation. 

In  chapter  5  we  presented  a  new  information  theoretic  feature  extraction  method.  We 
provided  a  clear  motivation  (Fano's  inequality)  for  using  mutual  information  as  the  crite- 
rion for  feature  extraction  in  a  classification  framework.  It  is  our  opinion  that  this  new 
method  represents  a  significant  advance  to  the  state  of  the  art  for  self-organizing  systems 
and  information  theoretic  signal  processing  in  several  regards.  It  utilizes  the  continuous 
form  of  entropy  and  mutual  information  as  opposed  to  the  discrete  form,  consequently  it 
can  be  used  for  continuous  mappings.  In  contrast  to  previous  entropy  based  approaches  it 
poses  no  limitation  on  the  number  of  hidden  layers  of  the  network  mapping.  Also  it  does 
not  require  the  underlying  pdf  to  unimodal,  again  in  contrast  to  previous  approaches. 
These  qualities  make  it  a  very  powerful  method  for  information  theoretic  signal  process- 


,  154 

ing.  As  such  this  method  has  wide  potential  application  beyond  nonlinear  extensions  to  the 
MACE  filter. 

A  significant  result  of  chapter  5  was  the  demonstration  that  a  global  property  of  a  map- 
ping, namely  information,  could  be  modeled  very  simply  by  local  interaction  of  the  data  in 
the  output  space,  significantly  reducing  the  computational  complexity  of  the  algorithm. 

We  also  demonstrated  how  this  method  could  be  applied  to  the  MACE  filter  such  that 
statistically  independent  features  rather  than  uncorrected  features  could  be  extracted  over 
the  rejection  class. 

In  the  course  of  the  discussion  we  presented  results  with  respect  to  ISAR  data.  The 
data  chosen  represents,  in  our  opinion,  a  fairly  difficult  classification  problem  in  the  sense 
that  the  range  of  distortions  for  the  ISAR  data  not  only  includes  rotation  in  aspect  but 
modifications  in  the  vehicle  configuration  and  differences  in  the  radar  depression  angle.  In 
spite  of  these  obstacles  the  nonlinear  system  generalizes  quite  well. 

It  is  our  opinion  that  the  results  of  this  research  represent  a  contribution  to  the  state  of 
art  in  the  areas  of  automatic  target  recognition  and  by  extension  pattern  recognition  as  well 
as  information  theoretic  signal  processing.  We  also  feel  that  this  research  has  established  a 
basis  for  a  continued  line  of  research.  In  particular  the  discussion  of  chapter  5  is  of  interest 
to  a  wider  audience  than  the  automatic  target  recognition  community.  Signal  processing 
problems  such  as  blind  source  separation,  independent  component  analysis,  and  parameter 
estimation  represent  potential  applications  of  the  technique.  Another  topic  we  have  men- 
tioned is  the  relationship  of  this  approach  to  the  self-organizing  feature  map  (SOFM)  of 
Kohonen.  These  areas  of  application  will  be  pursued  in  the  future. 


APPENDIX  A 
DERIVATIONS 

A  1    Frequency  nnmain  Relationships 
The  following  derivations  siiow  frequency  domain  relationships  for  unitary  discrete  fou- 
rier  transformations.  The  results  are  shown  for  the  one  dimensional  vectors,  but  can  be 
easily  extended  to  multiple  dimensions.  The  autocorrelation  sequence  of  a  complex,  wide- 
sense  stationary  process  x{n) ,  is  defined  as 

R^(m)  =  E(x*{n)x(n  +  m)) 
can  be  estimated  from  N  observations  of  the  process  (assuming  the  sequence  is  ergodic) 


N-\ 
Rxim)  =  jj'^x*(n)x(n  +  m).  (93) 

n  =  0 

A  relationship  can  be  derived  between  the  estimated  autocorrelation  sequence  and  the 
DFT  of  the  observed  sequence  using  the  unitary  DFT 

N-l  N-l 

Xik)=    J^xMJ=e-^"''"''  xin)=    X^(^)^^"'*"'^ 

n=0  k=0 


155 


156 


Substituting  x(.n)  with  its  DFT  in  (93)  yields 


Rxim)  =  V,  X 


U  =  o 


n  =  0 
N- IN-\N-] 


/■N-\ 


XXC/)-^-/''""""""''' 


V/  =  o 

lKkn/NJ2nHn+m)/N 


2nlm/N  ■v    J2n(,l-k)n/N 


N-lN-[  N-\ 

k  =  OI  =  0  n  =  r 

N-IN- 1 

1      _,  ,1       Inlm/N 

..(.)    =    ^I|X(.)|-^. 

*  =  0 

which  is  the  DFT  of  the  periodogram  of  the  observed  sequence  scaled  by  a  factor  1  /  JN . 
The  unitary  DFT  can  also  be  represented  by  matrix  operations  as 


k  =  / 
k:Al 


X  =  0x 
X  =  OtX 


(p„(yl:)  =  exp{-j2Kkn/N)/jN 


(94) 


157 
where  {x,  X,  (p„}  e  c'^"  '  are  complex  column  vectors.  The  DFT  relationship  between 
the  estimated  autocorrelation  sequence  Rxim)  and  the  periodogram  of  the  observed 
sequence,  P^{k)  =  |X(*)|   ,  is  written 


RAO) 
RAN-\) 


PAN -I 


PAO) 

PAN-\) 


=  Jn^ 


RAO) 

RAN-i) 


The  covariance  matrix  of  a  zero-mean,  complex,  wide-sense  stationary  process  x{n)  can 
be  estimated  from  N  observations  of  the  process  as 


£(x*(n)x(n))  £(j*(n  +  l)jt(/i))  ...  E(x*{n  +  N -  \)x(n)) 

£(x«(nWn+l))         £(r«(/i+ l)A:(n+ D)         ...  £(x»(n  +  W- l)x(n  +  1)) 

£(x*(nWn  +  N-l))  E{x*(n+  \)x(,n  +  N-  1))  ...  £(x*(n +N- l)x(n +/V- 1)) 

«,(0)        «^(-l)       ...R,(-N+l) 
1,0)        R,(0)         ...«,(-W+2) 

«,(A/-I)  R,{N'2)  ...  «,(0) 

Replacing  the  elements  of  the  covariance  matrix  with  the  autocorrelation  sequence  esti- 
mates and  applying  the  unitary  DFT  matrices  yields 


OEx^f  =  O 


RxiO)        /?x(-l)       ...Rx{-N+\) 
R,{\)         RAO)         ...RA-N  +  2) 

RAN-])  RAN -2)  ...  RAO) 


Ot. 


158 

Using  the  DFT  relationship  of  (94)  and  the  DFT  properties  x{n  -  m)  <^  jN(f^{k)X(k) 
and  (p„(fc)  =  ^>ii{'n)  yields 


♦I,*'  = 


''^C0)9|(0) 


p,^i)<fl,.,^i) 


P,(.N-\)(faiN-l)P,iN-\)<f,{N-\)  ...  P,^N-\)lf^_,^N-]) 


ot 


0 

P,0)  . 

0 

0      . 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 

P,(.N-\) 

0 
0 

0 
0 

.  P,(N-i) 


<Po(0) 
■Pod) 


1,(0) 
i,(l) 


(Po(N-l)  (p,(N-l) 

<Po(0)        "PoC)     ■ 
9,(0)        cp.d)     .. 

<P«-,(0)  <PA,-|d)  ■ 


**t 


■      "P^v 

i(0) 

■   »« 

,d) 

<p«-l 

(/V-1) 

of 


■PoCV-l) 

<P1(W-1) 
<P«-,(W-1) 


(95) 


*' 


where  P^  is  a  diagonal  matrix  whose  diagonal  elements  contain  the  periodogram  of  the 
observed  sequence  x{n) . 

The  output  variance  of  an  FIR  filter  /i  e  9?         due  to  a  zero-mean,  complex,  wide- 

nyNx  1     . 

sense-stationary  random  noise  sequence  input  n  g  in         is 


2  7      2 

a„  =  E[{hn)  ] 

=    £[(//«)(///!)] 

=  h^E[nn']h 


(96) 


159 

Insertion  of  the  unitary  DFT  transformation  matrices  into  equation  (96),  using  the  defini- 
tions of  (94)  and  the  identity  of  (95)  yields  the  frequency  domain  relationship 

(97) 

<\  =  H^Pn"  =  h'Y.^h 

A.?,  Optima]  Trade-off  of  Noi^e  Respnnfip.  with  F.rrnr  Variance  Subject  to  Zcro- 
Mpan  Frrnr  rnnstrainl 

Suppose  we  wish  to  relax  the  equality  constraints  with  regards  to  the  desired  outputs. 
That  is,  we  no  longer  require  that 

x^h  =  d  (98) 

and  instead  allow  the  output  response  at  the  locations  of  the  former  equality  constraints  to 
vary  with  zero  mean.  We  can  trade-off  the  noise  response  of  the  filter  with  respect  to  the 
error  variance  as  follows  ...  ... 


B/i^'/n-  ( 1  -  ^){x-^h  -  d)\x^h  -  d) 
h 

subject  to  the  constraint 

[\...\](x^h-d)  =  0. 
Adjoining  the  zero-mean  equality  constraint  to  the  optimization  criterion  yield 


J  =  p/i^A-H(l  -\^)(x^h-d)\x^h-d)  +  X[\...\\(x^h-d) 

=  p/i^/i  -H  ( 1  -  p)(At;fjct;,  -  (xd)t/i  -  h^xd  +  Sd)  +  X[\...\  ](xVx  -  d) 


160 


Computing  the  gradient  of  (99)  with  respect  to  h  yields 


a/ 

3/1 


=  2^h  +  2(\-^)x^xh-l(\-^)xd  +  x 


(100) 


Setting  (100)  to  zero  and  solving  for  h  yields 


h  =  (P/  +  (1-P:cx:t))-' 


(\-^)xd-x 


(iOl) 


Substituting  (101)  into  the  zero-mean  constraint  equation  yields  the  condition 


[l...l]A:t(p/  +  (l-P);<:j:t)-' 


(\-^)xd-x 


=  [l...l]^. 


(102) 


For  the  special  case  oi  d  =  [  1 . . .  1  ]  we  can  further  simplify  (102)  yielding 


[l...l]xt(P/+(l  -p)j:;ct)-' 


(,-.-!> 


Nr 


(103) 


Letting  x[  1  /N^. . .  1  /Nj]    =  x  and  factoring  out  the  scalar  term  ( 1  -  P  -  A./2)  yields 


fl-p-^1    =    ^(it(p/+(l-P);c;ct)-li)'' 


(104) 


161 


Substituting  (104)  into  (101)  yields  the  solution 


h  =  ^(|3/  +  (l-P^A:t))-'i 


A  '<  rnnvoliition  n(  Ciaiis'iian  Kp.mp.l  with  its  Gradient 
The  N -dimensional  Gaussian  kernel  takes  the  form 


(105) 


K(«) 


^expl  -zu  Z    « 


(2Tt)^^^|Z|'^^     "^  2 


(106) 


where  ue  Sj"'"  and  the  covariance  term,  Z  6  9t   ^    ,  is  a  positive  definite  matrix.  The 
gradient  of  the  kernel  with  respect  to  u  has  the  form  [Fukanaga,  1990] 


3k 
du 


K'(«) 


-I 


-k(u)I    u 

1 

"(2;t)''^^|Z|'^^ 


(107) 


exp  -rU    2,     «    2,     « 


Here  we  are  interested  in  these  functions  when  the  covariance  term  has  the  simplified 
form 


Z  =  a^/ 


'a'  0 

0 

0  c' 

:     0    . 

0 

0    0 

2 

a 

162 
in  which  case  equations  (106)  and  (107)  become 


K(«)  =  jj-r^-j:jexpl-^^uTu] 

(27C)      o        V  2a        '' 


and 


(108) 


{2n)      a  ^  2<j        J 

The  convolution  of  these  terms,  K^(u) ,  is  computed  as  follows 

K^(«)   =   K(U)*K'(U) 

=  Jk(u  -jir)K'(J:)<ii: 

=  -J,,   ,;v'  2^,2expf--^,((«-.r)T(„-x)+.rTx)W  d'O) 

(27c)  a  ^  2a  / 

Examination  of  equation  1 1 0  reveals  that  it  contains  a  vector  term.  The  convolution  of  the 
scalar  kernel  expression  with  vector  gradient  expression  is  carried  out  with  each  element 
of  the  gradient  vector.  Substituting  the  elements  of  the  vectors  u  and  x  into  equation  1 1 0, 
where 

U     =    [«,,   ...,«;^,]T 

and 

X  =  [j:,,  ....^JT, 


163 


we  can  rewrite  the  yth  element  of  the  vector  integral  as 


«.(«)  = 


T..^X\ 


exp|- — ^(«'«) 


(27t)  a 


jexp  -— X(^f--^;«;) 


dxAx 


'*! 


..\e.-(.'Q\-—^{Xj-XjU^YjdXj 


(27t)  a 


exp(-i-,(«T„)]^ 


,-     ,W    2N+2 

(2;i)  a 


n  ^"p  ~i  p^ 


rexp 


—2  pVitw, 


which,  as  our  final  result,  can  be  converted  back  to  vector  form  as 


(III) 


(112) 


K^(«)  =  K(«)*K'(u) 

^2       7[      a       ^      ^  4a  ^ 


Also  of  interest  is  the  gradient  (specifically  the  magnitude)  of  equation  112,  which  has 
the  form 


9k„(«) 


Ir  =  -(7Tt42^2>''p(-4-^("^"))(i  -  ^«S-  0.3) 


164 


The  magnitude  of  the  gradient  is 


IdK^iu) 


=  (?v7t4i^.>''p(-,V^("^"))|(i-^2"^") 


Evaluation  of  the  magnitude  at  \u\  =  0  gives 


9k„(«) 


du 


-( \ 


014) 


(115) 


|a|=0 

A  4  rmvolntinn  of  the  linifnnn  Distrihiition  Fiinntinn  with  the  nradient  of  the 
rianssian  Kp.rnel 

The  uniform  distribution  function  has  the  following  form 


fui")  = 


n(fc;-«,)  ■,a.<u,<b,,Vi 
i 
0  ;  otherwise 


(116) 


The  convolution  of  the  uniform  distribution  function  with  gradient  of  the  gaussian  ker- 
nel can  be  written  as 


fri")    =  /t/(«)*K'(«) 

=    l/(;(-»;)K'(«-x)aL»:' 


(117) 


which  is  an  Af-fold  integral  over  the  region  of  support,  Qy,  of  the  uniform  distribution 
function.  We  are  interested  in  the  result  of  this  vector  integration  when  the  kernel  gradient 
term  has  the  same  form  as  in  the  previous  section  and  the  uniform  distribution  function  is 
such  that 


b;  =  -0:  = 


r 


165 


With  these  conditions,  equation  117  can  be  rewritten 


/,(«)  =   lfu(x)K\u-x)dx 


2  2 

a  a 

=  jvfg-fr.-     J2   N.2^''p(-ri("--^)^("-^)]^"-^)'^ 


(118) 


(2ji)      0  V  2a 


«,,.     ,/V/2    A' 


yv7^fa-fa«''P   -r-2l(".-^.)'r"''^'^ 


l^  2o 


166 


The  yth  element  of  the  vector  integration  can  be  written 


^-'"'--^-.^(2.)-.-<n^V"^"^^"-""-^'^'^- 


■•'^2 ^"^i'l?'^"^  "  ■'^■^')'"'  ~ ""'^"^^ 


«,-    .N/2    N+1 


Yl^M 


erf 

V      V 


Ma 


-erf 


''\ 


Jlo 


••"^'H^it"^"  ^)')-''"'(-^("^  ^  1)1) 


72a 


!-p-n  erf  -p^   -erf 


(119) 


„n112 


erf 


^^C 


-erf 


1  (     \    I        a 


Ma^y  2a2r^-     2)  )     M 


Jla 

V  J 

1 


;-p(-i(«;i)')) 


erf 


V     V         y 


-erf 


^a 


K,(„^-|a)-K,(«^  +  |a)) 


167 

where  K^{u,  a)  is  the  one-dimensional  Gaussian  kernel  of  width  c .  The  vector  result  of 
the  convolution  is  then  written 


fri") 


-n- 

if  I 


-n- 


erf 


72a 


-erf 


(^K,(^«,+|aj-K,(^»,-|a 


erf 


72a 


-erf 


Ji<3 


'<^i("n  +  |<^)-'^i(%-|^J 


(120) 


REFERENCES 

Amit,  D.  J.  (1989);  Modeling  Brain  Function:  The  World  of  Attractor  Neural  Net- 
works, Cambridge  University  Press,  New  York. 

Bishop,  C.  (1995);  Neural  Networks  for  Pattern  Recognition,  Clarendon  Press, 
Oxford. 

Bell,  A.,  and  T.  Sejnowski  (1995);  "An  information-maximization  approach  to  blind 
separation  and  blind  deconvolution".  Neural  Computation  7:  1 129-1 159. 

Brasher,  J.,  J.  Kinsei  (1994);  "Fractional-power  synthetic  discriminant  functions", 
Pattern  Recognition  27  (4):  577-585. 

Casasent,  D.,  G.  Ravichandran,  and  S.  Bollapragada  (1991);  "Gaussian  minimum 
average  correlation  energy  filters",  App/.  Opt.  30  (35):  5176-5181. 

Casasent,  D.,  and  G.  Ravichandran  (1992);  "Advanced  distortion-invariant  minimum 
average  correlation  energy  (MACE)  filters",  Appl.  Opt.  31  (8):  1 109-1 1 16. 

Chiang,  H-C,  R.  Moses,  S.  Ahalt,  and  L.  Potter  (1995);  "Statistical  properties  of  linear 
correlators  for  image  pattern  classification  with  application  to  sar  imagery".  Proceedings 
ofSPlE,  2490 ,  266-277. 

Deco,  G.,  and  D.  Obradovic  (1996);  An  Information-Theoretic  Approach  to  Neural 
Computing,  Springer- Verlag,  New  York. 

Fano,  R.  M.  (1961);  Transmission  of  Information:  A  Statistical  Theory  of  Communica- 
tion, Wiley,  NY. 

Figue,  J.,  and  P.  Refregier  (1993);  "Optimality  of  trade-off  filters",  Appl.  Opt.  32  (1 1): 
1933-1935. 

Fisher  J.,  and  Principe,  J.  C.  (1994);  "Formulation  of  the  MACE  filter  as  a  linear  asso- 
ciative memory".  Proceedings  of  the  IEEE  International  Conference  on  Neural  Networks, 
5:  2934-2938. 

Fisher  J.,  and  Principe,  J.  C.  (1995a);  "Experimental  results  using  a  nonlinear  exten- 
sion of  the  minimum  average  correlation  energy  (MACE)  filter".  Proceedings  of  SPIE, 
2490:41-52. 


168 


169 

Fisher  J.,  and  Principe,  J.  C.  (1995b);  "A  nonlinear  extension  of  the  MACE  filter". 
Neural  Networks:  Special  Issue  on  Neural  Networks  for  Automatic  Target  Recognition,  8 
(7):  1131-1141. 

■    Fisher  J.,  and  Principe,  J.  C.  (1995c);  "Unsupervised  learning  for  nonlinear  synthetic 
discriminant  functions".  Proceeding  ofSPIE,  2752:  1-13. 

Funahashi,  K.  (1989);  "On  the  approximate  realization  of  continuous  mappings  by 
neural  networks,"  Neural  Networks  2  (3):  183-192. 

Fukanaga,  K.  (1990);  Statistical  Pattern  Recogntion  2nd  ed.,  Harcourt  Brace  Jovanov- 
ich,  Cambridge,  Massacheusetts. 

Gerbrands,  J.  (1981);  "On  the  relationships  between  SVD,  KLT,  and  PCA",  Pattern 
Recognition,  14:375-381. 

Gheen,  G.  (1990);  "Design  of  considerations  for  low-clutter,  distortion-invariant  cor- 
relation filters",  Optical  Engineering,  29  (9):  1029-1032. 

Hardle,  W.  (1990);  Applied  Nonparametric  Regression,  Cambridge  University  Press, 
New  York. 

Haykin,  Simon  (1994);  Neural  Networks  A  Comprehensive  Foundation,  IEEE  Press, 
Macmillan,  New  York. 

Hebb,  D.  (1949);  The  Organization  of  Behavior:  A  Neuropsychological  Theory,  Wiley, 
New  York. 

Hester,  C.  E,  and  D.  Casasent  (1980);  "Multivariant  technique  for  multiclass  pattern 
recognition",  App/.  Opt  19:  1758-1761. 

Hinton,  G.  E.,  and  J.  A.  Anderson  Ed.  ( 1 98 1 ),  Parallel  Models  of  Associative  Memory, 
Lawrence  Erlbaum  Associates,  New  Jersey. 

Hobson,  A.  (1969);  "A  new  theorem  of  information  theory",/  Stat.  Phys.,  3:  383-391. 

Kapur,  J.  N.,  and  H.  K.  Kesavan  (1992);  Entropy  Optimization  Principles  with  Appli- 
cations, Academic  Press,  Boston,  Massacheusetts. 

Khinchin,  I.  A.  (1957);  Mathematical  Foundations  Of  Information  Theory,  Dover  Pub- 
lications, New  York. 

Kohonen,  T.  (1988);  Self-Organization  and  Associative  Memory  (1st  ed.).  Springer 
Series  in  Information  Sciences,  8;  Springer- Verlag. 

Kohonen,  T.  (1995);  Self-Organizing  Maps;  Springer  Series  in  Information  Sciences, 
Springer  Series  in  Information  Sciences,  30;  Springer- Verlag,  New  York. 


170 

Kullback,  S.  (1968);  Information  Theory  and  Statistics.  Dover  Publications,  New 
York. 

Kullback,  S.,  and  R.  Leibler  (1951);  "On  information  and  sufficiency",  Ann.  Math. 
Stat.  22:79-86. 

Kumar,  B.  (1986);  "Minimum  variance  synthetic  discriminant  functions",  J.  Opt.  Soc. 
Am. /I  3  (10):  1579-1584. 

Kumar,  B.  (1992);  "Tutorial  survey  of  composite  filter  designs  for  optical  correlators", 
Appl  Opt.  31  (23):  4773-4801. 

Kumar,  B.,  Z.  Bahri,  and  A.  Mahalanobis  (1988);  "Constraint  phase  optimization  in 
minimum  variance  synthetic  discriminant  functions",  Appl.  Opt.  27  (2):  409-413. 

Kumar,  B,  A.  Mahalanobis,  S.  Song,  S.  Sims,  J.  Epperson  (1992);  "Minimum  squared 
error  synthetic  discriminant  functions",  Opt.  Eng.  31  (5):  915-922. 

Kumar,  B.,  J.  Brasher,  C.  Hester,  G.  Srinivasan,  and  S.  Bollapragada,  (1994);  "Syn- 
thetic discriminant  functions  for  recognition  of  images  on  the  boundary  of  the  convex  hull 
of  the  training  set".  Pattern  Recognition  27  (4):  543-548. 

Kumar,  B.  V.  K.,  and  A.  Mahalanobis  (1995);  "Recent  advances  in  distortion-invariant 
correlation  filter  design".  Proceedings  ofSPIE,  2490:  2-13. 

Kung,  S.  Y.  (1992);  Digital  Neural  Networks,  Prentice-Hall,  New  Jersey. 

Linsker,  R.  (1988);  "Self-organization  in  a  perceptual  system".  Computer,  21:  105- 
117. 

Linsker,  R.  (1990);  "How  to  generate  ordered  maps  by  maximizing  the  mutual  infor- 
mation between  input  and  output  signals".  Neural  Computation,  1 :402-4 1 1 . 

Mahalanobis,  A.,  B.V.K.  Vijaya  Kumar,  and  D.  Casasent  (1987);  "Minimum  average 
correlation  energy  filters",  Appl.  Opt.  26  (17):  3633-3640. 

Mahalanobis,  A.,  and  H.  Singh  (1994);  "Application  of  correlation  filters  for  texture 
recognition";  A/7/)/.  Opt.  33  (11):  2173-2179. 

Mahalanobis,  A.,  B.V.K.  Vijaya  Kumar,  Sewoong  Song,  S.R.F.  Sims,  and  J.F.  Epper- 
son (1994);  "Unconstrained  correlation  filters";  Appl.  Opt.  33  (33):  3751-3759. 

Mahalanobis,  A.,  A.  V.  Forman,  N.  Day,  M.  Bower,  R.  Cherry  (1994);  "Multi-class 
SAR  ATR  using  shift-invariant  correlation  filters",  Pattern  Recognition  27  (4);  619-626. 


171 

Novak,  L.  M.,  M.  C.  Burl,  and  W.  W.  Irving  (1993);  "Optimal  polarimetric  processing 
for  enhanced  target  detection",  IEEE  Transactions  on  Aerospace  and  Electronic  Systems, 
29(0:234-243. 

Novak,  L.  M.,  G.  Owirka,  C.  Netishen  (1994);  "Radar  target  identification  using  spa- 
tial matched  filters",  Pattern  Recognition  27  (4):  607-617. 

Oppenheim  A.  V.,  and  R.  W.  Shafer  (1989),  Discrete-Time  Signal  Processing,  Pren- 
tice-Hall, New  Jersey 

Papoulis,  A.  (1991),  Probability,  Random  Variables,  and  Stochastic  Processes  (3rd 
ed.),  McGraw-Hill,  New  York. 

Parzen,  E.  (1962);  "On  the  estimation  of  a  probability  density  function  and  the  mode", 
Ann.  Math.  Stat.  33:  1065-1076. 

Plumbey,  M.,  and  F.  Fallside,  (1988);  "An  information-theoretic  approach  to  unsuper- 
vised connectionist  models".  Proceedings  of  the  1988  Connectionist  Models  Summer 
School.  D.  Touretzky,  G.  Hinton,  and  T.  Sejnowski,  eds.,  Morgan  Kaufmann,  San  Mateo, 
CA,  239-245. 

Ravichandran,  G.,  and  D.  Casasent  (1992);  "Minimum  noise  and  correlation  energy 
filters",  App/.  Opt.  31  (11):  1823-1833. 

Refregier,  Ph.  (1991);  "Filter  design  for  optical  pattern  recognition:  multicriteria  opti- 
mization approach".  Opt.  Lett.  15(15):  854-856. 

Refregier,  Ph.,  and  J.  Figue  (1991);  "Optimal  trade-off  filter  for  pattern  recognition 
and  their  comparison  with  Weiner  approach".  Opt.  Comp.  Proc.  1 :  3-10. 

Richard,  M.,  and  R.  Lippman  (1991);  "Neural  network  classifiers  estimate  bayesian  a 
posteriori  probabilities".  Neural  Computation  3:461-483. 

Rosen,  J.  (1993);  "Learning  in  correlators  based  on  projection  onto  constraint  sets". 
Optics  Utters,  18(14):  1183-1185. 

Rosenblatt,  F.  (1958);  "The  perceptron:  a  probabilistic  model  for  information  storage 
and  organiztion  in  the  brain".  Psychological  Review,  65:  386-408. 

Rumelhart,  D.,  G.  Hinton,  R.  Williams  (1986);  "Learning  internal  representations  by 
error  backpropagation.".  Parallel  Distributed  Processing:  Explorations  in  the  Microstruc- 
ture  of  Cognition  (D.  Rumelhart  and  J.  McClelland,  eds.),  1:  322-328.,  MIT  Press,  Massa- 
cheusetts. 

Scharf,  Louis  L.  (1991);  Statistical  Signal  Processing:  Detection,  Estimation,  and 
Time  Series  Analysis,  Addison-Wesley  Publishing  Company,  New  York. 


172 

Schmidt,  W.,  and  J.  Davis  (1993);  "Pattern  recognition  properties  of  various  feature 
spaces  for  higher  order  neural  networks",  IEEE  Transactions  on  Pattern  Analysis  and 
Machine  Intelligence,  15  (8):  795-801. 

Shannon,  C.  E.  (1948);  "A  mathematical  theory  of  communications",  Bell  Systems 
TechnicalJoumal,  27:  379-423. 

Sudharsanan,  S.  I.,  A.  Mahalanobis,  and  M.  K.  Sundareshan  (1990);  "Selection  of 
optimum  output  correlation  values  in  synthetic  discriminant  function  design",  J.  Opt.  Soc. 
Am.  A,  7  (4):  611-616. 

Sudharsanan,  S.  I.,  A.  Mahalanobis,  and  M.  K.  Sundareshan  (1991);  "A  unified  frame- 
work for  the  synthesis  of  synthetic  discriminant  functions  with  reduced  noise  variance  and 
sharp  correlation  structure",  Appl.  Opt.  30  (35):  5176-5181. 

Vander  Lugt,  A.  (1964);  "Signal  detection  by  complex  matched  spatial  filtering", 
IEEE  Trans.  Inf.  Theory.  10  (23):  139. 

Viola,  P.,  N.  Schraudolph,  and  T.  Sejnowski,(1996);  "Empirical  entropy  manipulation 
for  real-world  problems".  Neural  Information  Processing  Systems  8,  to  appear  in  pub- 
lished proceedings. 

Werbos,  P.  (1974);  "Beyond  regression;  new  tools  for  prediction  and  analysis  in  the 
behavioral  sciences".  Ph.  D.  Thesis,  Harvard  University,  Cambridge,  MA. 

Widrow,  B.,  and  M.  Hoff  (1960);  "Adaptive  switching  circuits",  IRE  Wescon  Conven- 
tion Record,  96-104. 

Wilkinson,  T.  and  J.  Goodman  (1991);  "Synthetic  discriminants  and  eigenvector 
decompositions",  Appl.  Opt.  30  (23):  3278-3280. 

Wong,  B.,  and  I.  Blake(1994);  "Detection  in  multivariate  non-gaussian  noise",  IEEE 
Transactions  on  Communications.  42  . 


BIOGRAPHICAL  SKETCH 

Mr.  Fisher  was  bom  April  13,  1965.  He  earned  his  bachelor's  degree  in  electrical  engi- 
neering from  the  University  of  Florida  in  1987.  He  was  a  graduate  research  assistant  in  the 
Electronic  Communications  Laboratory  at  the  University  of  Florida  from  1987  until  1990, 
during  which  time  he  earned  his  Master  of  Engineering  degree  from  the  University  of 
Florida.  He  has  continued  his  affiliation  with  the  ECL,  as  both  a  faculty  member  and 
graduate  research  assistant,  since  1990,  during  which  time  he  has  conducted  research  in 
the  areas  of  ultra-wideband  radar  for  ground  penetration  and  foliage  penetration  applica- 
tions, radar  signal  processing,  and  automatic  target  recognition  algorithms.  He  has  also 
performed  duties  as  a  graduate  research  assistant  and  Ph.  D.  candidate  in  the  Computa- 
tional NeuroEngineering  Laboratory,  during  which  time  he  has  conducted  research  (Ph.D. 
topic)  on  nonlinear  extensions  to  synthetic  discriminant  functions  with  application  to  clas- 
sification of  mm-wave  SAR  imagery. 


173 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to  acceptable  standards  of 
scholarly  presentation  and  is  fully  adequate,  in  scope-amhquality,  as  a  dissertation  for  the  degree  of 
Doctor  of  Philosophy. 


I  certify  that  I  have  read  this  study  and  tSat  in  my  opinion  it  conforms  to  acceptable  standards  of 
scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a  dissertation  for  the  degree  of 
Doctor  of  Philosophy. 

Thomas  E.  Bullock 

Professor  of  Electrical  and  ComputerEngineering 

I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to  acceptable  standards  of 
scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a  dissertation  for  the  degree  of 
Doctor  of  Philosophy. 


A/  A^'      /:^^£Uy-^' 


John  M.  M.  Anderson 

Assistant  Professor  of  Electrical  and  Computer  Engineering 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to  acceptable  standards  of 
scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a  dissertation  for  the  degree  of 
Doctor  of  Philosophy. 


~4k^<£iAi±. 


ssistant  Professor  of  Electrical  and  Computer  Engineering 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to  acceptable  standards  of 
scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a  dissertation  for  the  degree  of 
Doctor  of  Philosophy. 


Frank  J.  Bova 

Professor  of  Nuclear  and  Radiological  Engineering 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to  acceptable  standards  of 
scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a  dissertation  for  the  degree  of 
Doctor  of  Philosophy. 


lA^-y^^y\.i.../i^P\^^<-'>r^4^ 


Andrew  F.  Laine 

Associate  Professor  of  Computer  and  Information 
Science  and  Engineering 


This  dissertation  was  submitted  to  the  Graduate  Faculty  of  the  College  of  Engineering  and  to  the 
Graduate  School  and  was  accepted  as  partial  fulfillment  of  the  requirements  for  the  degree  of 
Doctor  of  Philosophy. 


May  1997 


P 


Winfred  M.  Phillips 

Dean,  College  of  Engineering 


Karen  A.  Holbrook 
Dean,  Graduate  School 


UNIVERSITY  OF  FLORIDA 


3  1262  08556  6387 


